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ABSTRACT 


The  work  in  pattern  recognition  and  classification  is  still 
in  its  formative  stages.  The  types  of  classifiers  existing  have  been 
discussed  but  many  of  their  properties  and  capabilities  have  not  been 
investigated.  The  training  of  these  machines  also  requires  more  study 
to  be  fully  understood.  In  this  thesis,  a  review  of  many  of  the  ideas 
in  pattern  classification  has  been  given.  Some  of  these  ideas  have 
been  extended  and  enlarged  upon. 

A  recent  nonparametric  training  method  has  also  been  in¬ 
vestigated.  The  usefulness  of  this  method  is  questioned  in  view  of 
the  results  of  a  simulation  study  discussed  in  this  work.  This  method 
is  also  compared  to  another  recent  nonparametric  training  method. 
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CHAPTER  1  INTRODUCTION 

1.1  Some  Basic  Ideas  in  Pattern  Classification 

The  process  of  learning  as  experienced  by  humans  is  not  fully 
understood  at  the  present  time.  It  is  known,  however,  that  the  ability 
to  learn  is  very  closely  associated  with  the  ability  to  classify  data. 

For  example  if  a  child  is  to  learn  not  to  touch  hot  things  it  must 
classify  ’things  which  it  finds  to  be  hot'  in  the  category  ’things  it 
shouldn't  touch'.  The  data  in  this  case  is  the  description  of  the  hot 
item.  The  child  may  touch  this  item  more  than  once  before  making  the 
correct  mental  classification.  There  exists  a  certain  training  period 
common  to  all  learning.  The  length  of  the  training  period  will  be 
directly  related  to  the  extent  of.  the  consequences  of  misclassif ication, 
namely,  the  burning  sensation  which  accompanies  touching  the  hot  item. 

All  these  facts  plus  many  more  which  are  not  fully  understood  are  inte¬ 
grated  in  some  complex  way  to  produce  learning  in  the  human  mind. 

The  idea  of  building  a  machine  which  is  able  to  learn  has  been 
studied  for  the  past  fifteen  or  twenty  years.  One  method  of  approach  to 
this  problem  has  been  portrayed  in  the  research  into  pattern  recognizers 
and  classifiers.  The  physical  phenomena,  of  which  one  was  mentioned 
above,  has  been  studied  and  an  effort  to  mechanically  or  electronically 
classify  data  has  been  made.  Widrow1*^  and  Huber3  have  published  papers 
describing  machines  which  have  been  successfully  tested  on  classification 
problems.  The  success  of  these  attempts  has  resulted  in  useful  applica¬ 
tions  of  such  machines4*5. 

To  formulate  the  problem  mathematically  we  must  first  define 
the  physical  phenomena  using  mathematical  terminology.  Before  classi¬ 
fication  can  begin,  a  set  of  measurements  of  the  significant  properties 
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of  the  object  or  phenomena  to  be  classified  must  be  obtained.  (Deciding 
what  constitutes  a  significant  property  is  in  itself  a  study  towards 
which  a  vast  amount  of  research  has  been  directed.)  Some  specific 
number,  say  n,  of'  independent  measurements  will  be  made  on  each  item 
to  be  classified.  These  n  measurements  will  be  used  as  the  components 
of  an  n-dimensional  vector  called  the  pattern  vector.  In  this  way  each 
item  considered  can  be  uniquely  associated  with  a  point  in  a  Euclidean 
n-dimensional  space,  En.  This  space,  known  as  the  pattern  or  measure¬ 
ment  space,  is  then  divided  into  appropriate  sets  for  classification  by 
the  use  of  sets  of  functions  known  as  discriminant  or  decision  functions. 
Equating  these  discriminant  functions  to  zero  yields  a  set  of  hyper¬ 
surfaces  in  the  pattern  space  known  as  decision  surfaces.  The  positioning 
of  these  surfaces  is  determined  by  the  discriminant  function  which  is 
a  function  of  the  pattern  vector  together  with  a  set  of  weighting 
variables.  These  weighting  variables  form  a  vector  called  the  weight 
vector.  The  training  process  for  the  machine  consists  of  moving  the 
decision  surfaces  in  En  in  an  effort  to  recognize  or  classify  patterns. 
This  is  done  by  adjusting  the  weight  vectors. 

As  was  stated  previously,  there  exist  consequences  for  mis- 
classification  which  in  many  eases  will  vary  with  the  category  from 
which  the  pattern  arises.  This  phenomenon  is  called  the  loss  or  cost 
function,  since  the  consequences  of  misclassif ication  can  be  considered 
as  a  loss  incurred  by  wrongly  classifying  a  pattern  of  one  category  as 
one  belonging  to  another  category. 

1.2  Scope  of  the  Thesis 


The  aim  of  this  thesis  will  be  two-fold.  First  a  general 
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description  of  several  types  of  learning  machines  and  some  theorems 
relating  to  their  existence  and  use  will  be  given.  This  will  be  followed 
by  a  discussion  of  training  methods  for  these  machines.  The  reasons  for 
the  inclusion  of  this  part  are  to  give  a  brief  view  of  the  present  state- 
of-the-art  in  this  area  and  also  to  verify  and  extend  some  of  the  available 
theorems.  The  second  part  of  the  thesis  will  include  a  particular  pattern 

classification  problem  taken  from  a  recent  paper.  The  methods  used  in 
this  paper  to  attack  the  problem  will  be  stated  and  analyzed,  and  a  com¬ 
parison  of  these  methods  will  be  given.  It  is  hoped  that  this  discussion 
will  provide  some  insight  into  the  problems  that  have  to  be  faced  by 
researchers  in  the  area  of  pattern  recognition  and  classification. 

Chapter  two  includes  a  discussion  of  various  types  of  classifiers. 
The  first  four  types  of  machines  mentioned  receive  only  brief  explanations. 
The  reader  who  is  interested  in- these  types  of  classifiers  can  find  very 
comprehensive  discussions  on  them  in  the  references  and  also  in  the  many 
other  papers  available  on  the  subject.  The  section  on  layered  machines, 
however,  is  a  little  more  comprehensive.  This  is  because  the  current 
literature  has  not,  in  general,  included  this  type  of  classifier.  The 
results  presented  in  section  2.5  are  for  the  most  part  original  and  it  is 
felt  that  they  show  the  usefulness  of  such  a  classifier. 

Chapter  three  attempts  to  give  a  general  view  of  both  the  para¬ 
metric  and  the  nonparametric  training  methods.  Chapter  four  gives  the 
results  of  a  simulation  study  of  a  classification  problem  using  two  of 
these  algorithms. 
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CHAPTER  2  TYPES  OF  PATTERN  CLASSIFIERS 

2.1  Introduction 

The  qualitative  description  of  pattern  classification  can  be 
translated  into  mathematical  terms  by  using  set  theoretic  concepts.  Let 

A  ,  A  ,  ...  A  denote  the  sets  of  measurements  or  properties  relating  to 
1  2  R 

R  classes  or  categories  of  patterns  in  the  n-dimensional  Euclidean  space 

from  which  the  pattern  vectors  are  taken-  Then  the  pattern  space,  En, 

can  be  considered  to  be  the  union  of  the  pattern  classes.  That  is: 

En=AUAUAU  ...  UA  (2.1) 

12  3  R 


The  basic  requirement  for  pattern  recognition  is  that  the  sets 


A  ,  A  ,  . . . ,  A  must  be  non-overlapping  or  their  intersection  must  be  an 
12  R 

empty  or  null  set.  That  is: 


A  HA  HA  0...n  A  =  0  (2.2) 

12  3  R 


Equation  2.2  may  be  valid  either  in  the  probabilistic  or  deterministic 
sense.  By  probabilistic  sense  we  mean  AHA  =0  with  a  probability 

1  j 

approaching  one  but  not  necessarily  equal  to  one. 

The  next  step  is  to  attempt  to  separate  these  pattern  classes. 

This  can  be  done  by  a  set  of  functions  f  (X) ;  i  =  1,2,  ...,  R;  known  as 

discriminant  functions  and  having  the  property 

f  (X)  >  f  (X)  for  X  e  A  (i,j  =  1,2,  . . . ,  R;  i  f  j) 
i  J  i 

(2.3) 

These  functions  can  take  on  one  of  a  number  of  forms;  linear,  piecewise 
linear,  polynomial,  etc.  The  types  of  classifiers  are  named  after  the 
types  of  discriminant  functions  which  they  simulate.  The  various  types 
of  classifiers  will  be  discussed  in  the  sections  that  follow. 
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2.2  Linear  Classifiers 


Of  all  classifiers  the  linear  classifier  is  the  one  which  has 
been  analyzed  most  thoroughly.  A  linear  classifier  can  be  described  as  an 
implementation  of  a  set  of  n-dimensional  hyperplanes.  The  mathematical 
analysis  of  a  linear  classifier  is  by  far  the  most  straight  forward  and  the 
implementation  of  such  a  machine  can  be  achieved  using  very  inexpensive  but 
reliable  networks. 

Such  a  machine  is  depicted  in  Figure  2.1  where  the  necessary 
components  are  a  set  of  weighting  and  summing  devices.  A  resistive  adder 
network  would  be  sufficient  for  such  a  job. 

Consider  a  two-dimensional  pattern  space  containing  two  cate¬ 
gories,  A  and  A  .  A  linear  classifier  can  be  made  which  assigns  an 
1  2 

unclassified  pattern  X  to  category  A  if  W^X  >0  and  to  A  otherwise. 

i  1  i  1 

The  discriminant  function  in  this  case  is 

f(X)  =  Wfc*X  =  w1x^  +  w  x  (2.3) 


where 

X  = 

‘xi 

and 

W  = 

_wr 

*2 

w 

L  2j 

and  superscript  t  denotes  transposition.  The  symbol  0  represents  a 
threshold  value  for  classification.  This  threshold  value  is  made  a  part 
of  the  discriminant  function  by  many  authors  by  increasing  the  dimension 
of  the  weight  vector  by  one  and  augmenting  the  pattern  vector  with  a  term 
equal  to  one  for  all  patterns,  i.e. 


X1 

«r 

X  = 

x2 

and  W  = 

w2 

1 

w_ 

L  3j 

The  criterion  for  classification  will  then  be  X±  e  A^  if  f(X^)  >  0  and 

X  e  A  otherwise,  with  f(X  )  being  defined  in  terms  of  these  new  pattern 
i  2  i 
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Discriminators 


Fig.  2.1  A  linear  machine 
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and  weight  vectors.  This  latter  approach  will  be  assumed  in  the  remainder 


of  this  work. 

Increasing  the  number  of  measurements  on  the  item  to  be  classi¬ 
fied  to  n  increases  the  dimension  of  X  to  n+1.  The  new  decision  surface 
defined  by  f(X)  =  0  will  be  an  n-dimensional  hyperplane.  Increasing 
the  number  of  categories  from  which  the  patterns  may  arise  to  R  changes 

the  classification  rule  to  X  e  A.  for  f . (X.  )  >  f . (X,  )  (j  =  1,  2,  ....  R; 

k  i  l  k  •  J  k 

j  ^  i)  .  The  decision  surfaces  defined  by  f^X)  -  f  (X)  =  0  where  and 

A  are  adjoining  sets  in  the  pattern  space,  make  up  a  set  of  n-dimensional 

hyperplanes.  There  exist  R(R-l)/2  of  these  hyperplanes  of  which  some  may 

be  redundant  (for  example  if  the  two  categories  a  and  b  are  not  contiguous 

the  decision  surface  given  by  f  (X)  -  f  (X)  =  0  is  redundant) . 

a  b 

Now  suppose  we  have  an  n-dimensional  pattern  space  containing 
N  distinct  points.  The  effectiveness  of  a  discriminant  function  can  be 
measured  by  the  number  of  categories  this  type  of  discriminant  function 
can  divide  the  pattern  space  into.  For  a  linear  function  there  exist 
L(N,n)  distinct  divisions  (known  as  linear  dichotomies)  of  the  pattern 
space  provided  no  n+1  points  lie  on  an  n-1  dimensional  hyperplane. 

Nilsson7  has  shown  this  number  to  be 


L(N,n)  =  2 


n  /  N-1  \ 

I  V  i  / 


=  2 


i=0 

N 


for  N  >  n 


for  N  <  n 


(2.4) 


where 


'N-1 


is  the  binomial  coefficient  (N-1) ! / (N-l-i) ! i ! .  This  can  be 


N 

compared  with  2  which  is  the  total  number  of  possible  dichotomies  of  N 
points . 

Highleyman8  states  that  the  optimal  decision  function  for  the 
following  two  class  problems  are  linear. 

1)  The  two  classes  are  equally  probable  a  priori,  have  equal 
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losses  associated  with  misrecognition,  and  have  probability 
distributions  over  the  measurement  space  which  are  unimodal, 
spherically  symmetrical,  and  identical  except  for  a 
displacement  of  modes. 

2)  The  two  classes  are  equally  probable  a  priori,  have  equal 
losses  associated  with  misrecognition,  and  have  probability 
distributions  over  the  measurement  space  which  are 
Gaussian  and  which  have  equal  covariance  matrices. 

3)  The  convex  hulls  of  the  points  in  measurement  space  con¬ 
tained  in  each  pattern  class  are  nonintersecting. 

These  conditions  are,  however,  more  restrictive  than  is  necessary.  The 
conditions  of  equal  probability  and  equal  losses  in  (1)  and  (2)  only 
serve  to  vary  the  threshold  0  and  so  inequality  of  either  the  probabilities 
or  losses  or  both  would  still  give  a  linear  solution. 

One  very  important  application  of  linear  classifiers  is  the 
piecewise  linear  machine.  Such  a  machine  can  be  used  in  situations  where 
the  minimum  distance  between  a  pattern  and  a  category  is  the  criterion  for 
classification.  This  distance  is  usually  defined  in  terms  of  a  norm  in  the 
pattern  space. 

The  discriminant  function  for  a  piecewise  linear  classifier 

is  given  by 

f  (X)  =  max  {g.j(X)}  (i  =  1,  ...»  R) 

1  j  =  1,  ....  L  1  (2.5) 

where  g  J (X) ,  called  a  subsidiary  discriminant  function,  is  a  linear 
i 

function  of  X.  (g.J(X)  =  W  X) .  Such  a  machine  is  shown  in  Figure  2.2. 

While  the  discriminant  function  of  this  machine  is  linear  for  all  X  the 
addition  of  a  second  maximum  selector  results  in  the  machine  response 
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Discriminator  ] 


Fig. 2.2  A  piecewise  linear  machine 
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suddenly  changing  from  one  subsidiary  function  to  another.  The  decision 

surfaces  will  therefore  be  made  up  of  portions  of  hyperplanes  in  the 

pattern  space.  These  hyperplanes  will  be  connected  but  their  directions 

are  completely  arbitrary.  As  a  result  the  categories  described  by  these 

discriminant  functions  are  not,  in  general,  convex. 

Let  P. ,  . . . ,  P  be  R  point  sets  in  En  and  let  POP  =  0 
1  R  i  j 

(i^j).  Let  us  suppose  that  we  wish  to  synthesize  a  machine  which  will 

put  a  pattern  X  e  E*1  in  a  category  P  if  the  distance  between  X  and  P 

i  i 

is  minimum.  We  will  show  that  a  piecewise  linear  machine  will  do  the 

job.  Let  the  distance  between  X  and  P  be  given  by  d(X,P.).  We  will 

i  i 

conclude  X  e  P,  if 

k 


d(X,P  )  <  d(X,P  ) 
k  3 


(jtfO 


Defining  d(X,Y)  =  j X— Y |  gives 


d(X,P  )  -  inf  I X-p I 

1  pep. 

i 

d2(X,P.)  =  inf  (X-p, X-p) 

1  P£P. 

l 

=  2  sup  {(p-X)-Js(X*X)-Js(p'p)} 
peP 

i 

Since  X  is  a  fixed  pattern,  it  is  sufficient  to  consider  only 

sup  { (p'X)-^(p'p) } 

pcPi 

We  can  now  define  our  discriminant  function  as 

f  (X)  =  sup  { (p#x)-%(p‘p) } 

1  P^. 

l 

However,  if  P  is  a  point  set  of  continuum  this  will  give  an  infinite 
i 

(nondenumerable)  family  of  discriminant  functions  and  thus  will  not  be 

of  any  practical  use.  But,  since  P^O  P,  =  0  for  all  i,  a  countable  set 

of  points  P.  can  be  chosen  to  represent  the  category  i,  such  that 
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FIGURE  2.3  AN  EXAMPLE  OF  MINIMUM  DISTANCE  CLASSIFICATION 


L 

U1  N  (P.  )  3  P. 

k=l  e  xk  1 


where  N  (P  )  is  an  e  neighborhood  of  P  e  P  .  And  so  we  have 
e  i,  i.  i 

k  k 

f  (X)  =  max  g  (X,P.  ) 

k  =  1 ,  . . . ,  Li  1k 

where 

g.  (X,P.  )  =  (P.  -X)  -  %(P  -P.  ) 

l,  i,  l,  i,  i. 


(2.6) 


g  is  a  linear  function  of  X  and  W  (the  P  will  form  the  components 

ik  Xk 

w^  of  W) .  This  is  of  the  same  form  as  equation  2.5  which  is  the  dis¬ 
criminant  function  of  a  piecewise  linear  machine.  An  example  of  a  two- 
dimensional  pattern  space  divided  into  two  categories  by  piecewise 
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linear  functions  is  shown  in  Figure  2.3.  We  notice  from  here  that  the 
resulting  categories  need  not  be  convex  sets. 

One  practical  example  of  minimum  distance  classification 
arises  in  weather  forecasting.  The  weather  pattern  for  a  given  location 
and  time  may  be  caused  by  either  the  moving  in  of  conditions  prevailing 
elsewhere  or  by  the  particular  geological  surroundings.  These  causes 
would  be  greatly  separated  in  the  measurement  space  but  the  same  weather 
conditions  may  result  from  either  cause.  Since  it  is  the  area  immediately 
around  each  point  in  the  pattern  space  that  makes  up  the  particular 
category,  an  averaging  of  these  points  to  produce  one  central  point  would 
not,  in  general,  produce  a  satisfactory  categorization.  In  such  cases, 
the  minimum  distance  concept  of  classification  would  produce  the  best 
result . 

The  minimum  distance  classifier  is  a  very  effective  classifier 
if  the  position  of  the  categories  in  the  pattern  space  is  known  a  priori. 
However,  if  this  knowledge  is  unavailable  the  classification  problem 
becomes  very  difficult  due  to  the  difficulties  of  training  a  piecewise 
linear  machine.  Such  a  training  procedure  requires  not  only  an  adjustment 
of  the  weight  values  but  also  an  adjustment  in  the  number  of  subsidiary 
discriminant  functions  in  each  discriminator.  As  yet  no  one  has  come  up 
with  such  a  training  method  and  so  this  problem  would  pose  a  very  inter¬ 
esting  research  project  for  anyone  interested. 

2.3  Quadric  Classifiers 

The  linear  machines  just  discussed  were  a  special  case  of 
polynomial  machines  whose  study  comes  under  the  heading  $  machines. 
However,  one  special  case  of  polynomial  machines,  namely,  the  quadric 
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machine,  can  produce  an  optimal  solution  for  an  important  class  of 
problems.  For  this  reason  we  shall  take  a  look  at  it  separately  and 
this  will  also  give  an  introduction  to  $  machines. 

The  quadric  discriminant  function  takes  the  form 


n 


f„(X)  =  I  w^x^2  +  l  l  w41.x  x  +  l  w4x4  +  ~n+1 

(2.7) 


n-1  n 


n 


j  =  l  33  3  j  =  1  k=j+l  j=l  J  J 


w 


which  can  be  written  in  matrix  notation  as 

f  (X)  =  X^.X  +  XtB  +  C. 


(2.8) 


The  decision  surfaces  obtainable  with  a  quadric  machine 
are  sections  of  second-degree  surfaces.  These  include  hyperellipsoids 
for  A  positive  definite  which  form  hyperspheres  if  A_  is  a  diagonal 
matrix,  hyperellipsoidal  cylinders  for  positive  semidefinite  and 
hyperhyperboloids  for  other  types  of  A^. 

If  equation  2.7  were  written  out  in  full  it  would  take  the 


form 


2  2 

f,(X)  =  w,  x,  +  w  x^+...+w  xx  +  w^  x  + 
i  11  1  22  2  12  1  2  23  2  3 


+...+WX  +  w„x„  +  . . .  +  w  , . 

11  22  n+1 


Suppose  a  mapping  F  exists  such  that  F:  En+Em  (m>n)  in 


(2.9) 


such  a  way  that 


2  2 

F(X)  ~  (x  ,x  ,  ...,  xx, x  x  ,  ...,  x  x  j  ...,  1) 
12  1223  12 

(2.10) 


We  can  now  write 


f  (X)  =  W1tF(X) 


(2.11) 


where 


b/ifc  x-1*4  3*  q»r  d*  ^ool  b  IlBfie  aw  nosBSi  eifl3  10!  .amsldoiq 


,3  +  a^x  +  x  . 

r  Jioaql  J  i93fli  il  sbuloni  s  :3i\ "  ■  rbob^ius  9Bi§t>b-boooae  io  enoJ:J393  i^B 


bn6  9dJ:nl3t9bJ;m9a  svidieoq  A  to!  axsbfiXI^o  Isbioaql  L9i^qxrf 

.  A  5to  89qx^  ,39d3o  io3  a,  <ioIodi9qxdi»qx» 

urtoi 


+  ,.X  X  W  +  x  x  w  +  - +  '  x  w  + 

£  I  £1 


w  +  ...  +  _x„w  4*  x  v  -f  ...  + 


(1  «...  ,  .  c  «...  «  X  X.^XjX  « . 


(01. S) 


(X)’?  ;  W  -  (>  )  5 


\A/  ^  . ,v  “  - y  /  , 
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W  and  F(X)  will  contain 


m  =  n  +  n(n-l)/2  +  n+l 
=  (n2  +  3n  +  2)/2 


(2.12) 


terms  where  n  is  the  dimension  of  the  pattern  vector.  If  a  processor 
were  to  be  constructed  whose  output  was  the  vector  F (X)  for  an  input  of 
the  pattern  vector  X,  the  quadric  classifier  could  be  built  by  connecting 
the  output  of  such  a  processor  to  the  input  of  an  m-dimensional  linear 
machine.  The  processor  would  consist  of  a  set  of  multipliers  which, 
although  not.  as  economical  or  reliable  as  the  network  making  up  the 
linear  classifier,  are  readily  available. 


A  quadric  classifier  is  able  to  produce  the  optimal  solu¬ 


tion  for  a  two  category  problem  in  which  the  patterns  within  the  cate¬ 
gories  have  a  Gaussian  distribution.  Problem  (2)  on  page  8  is  a  special 
case  of  this  general  problem.  The  actual  derivation  of  this  discriminant 
function  will  be  given  in  Chapter  3.  Since  many  of  the  natural  observa¬ 
tions  which  one  would  wish  to  categorize  can  be  approximated  by  a 
Gaussian  distribution  the  quadric  machine  is  a  very  important  one. 

2.4  $  Classifiers 

As  we  have  seen  in  the  last  section  the  quadric  machine 
can  be  built  by  first  building  a  processor  to  produce  F(X)  and  then 
using  an  m-dimensional  linear  machine  for  classification.  Extending 
this  idea,  suppose  F:  E^E111  (m>n)  such  that 


•  09 


(2.13) 


nlBJnoo  IIlw  (X)1  bns  W 


srfu  qir  gnidBnt  ^iowJ9a  sriJ  se  sldBilsi  10  l&olmoaooa  86  3on  dguoridlfi 


-edS3  .nrix  niri;  i\*  <  nsJJBa  9 if 2  rf  •  d  r  n  :  m  >Idcnc  Yv  '  3  own  t  jo  :  rtohi 
Islosqa  £  ei  8  9gsq  no  (  )  roaldoi  croi.tu  iUiib  nr.  se^  >  .  av  ,  c. 
n  sniar’  >e  >  airfn  2o  no.  i  vit  b  Ib.-jjob  arIT  .mail  Ls-brrg  ?  n  k  wo 

.£  LOdqsriO  na  n9\  Lg  9d  Xw  nolJonul 
b  ^d  b9d6£Uxon[qq6  9d  hbo  9s1jc  >9363  od  it  iw  blucw  9/io  rfolrfw  anoix 
.9no  XfiBmoqrai  ^9v  b  ai  9nxri.6m  3X'  ;p  9r  n  noXXu  incaaXb  nsiaaosO 


aiailiae  »XD  $  A.  £ 


i  oildE  jp  9r  J  j  o  no;  ■<  .»•  £i  erfa  jjj  l.  n  .  . 

narij  bns  (X)1?  SDoboiq  od  loaaaooiq  6  gniblXud  r'axii  \(d  Jlxod  ad  n  ,D 


15 


where  F^(X)  are  polynomial  functions  of  X.  A  processor  can  be  built 

to  produce  such  an  F(X)  from  the  pattern  vector  X  and  its  output  can  be 

connected  to  the  input  of  a  linear  machine.  The  resulting  discriminant 

function  is  a  linear  function  of  the  weight  vector  elements,  w. ,  and 

can  be  written  in  the  form 

$(X)  =  w.F  (X)  +  w  F  (X)  +  ...  +  w  F  (X)  (2.14) 

■Li  l  l  mm 

Such  a  machine  is  called  a  $  machine  and  is  shown  in  Figure  2.4.  If 

F  is  the  identity  operator,  F(X)  =  X,  the  resulting  machine  is  linear. 

If  F  maps  X  into  F(X)  of  the  form  x  x  (i,j  =  0 ,  1 ,  . . . ,  n)  (x  =1) 

i  j  0 

the  resulting  machine  is  quadric.  If  F  maps  X  into  F(X)  of  the  form 

xxx  ...  Xi  (k  >  ^  k  —  0,  1,  2,  .a.,  n) 

k.  k0  k  k  12  r 

12  3  r 

the  resulting  machine  is  an  r’th  order  polynomial  machine.  The  proces¬ 
sor  in  all  cases  consists  of  a  set  of  multipliers. 

These  polynomial  machines  will  require  a  processor  and  need 
more  weighting  devices  than  the  linear  machine  and,  therefore,  will  be 
less  economical.  The  main  justification  for  their  development  and  use 
lies  in  their  increased  ability  to  classify.  From  equation  2.4  we  know 
the  number  of  linear  dichotomies  of  N  points  is 

L(N,n)  =  2  \  *  j  for  N>n 

i=0  V  i  ! 

=  2N  for  N^n 

In  the  case  of  $  machines  we  have  a  linear  function  of  the  weight  vector 
in  a  higher  dimensional  space.  This  means  we  are  attempting  to  separate 
the  mapped  categories  in  Em  by  linear  functions.  But  equation  2.4  gives 
the  number  of  linear  dichotomies  of  N  points  in  a  space  of  specified 


dimensions . 


Therefore,  it  follows  that  the  number,  say  P,  of  distinct 
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divisions  of  the  pattern  space  by  a  $  function  will  be  the  same  as  the 
number  of  linear  dichotomies  of  the  mapped,  En  space  i.e. 


P(N,n)  =  L(N,m)  =  2  J  fN_1  ) 


for  N>m 
for  Nlm 


(2.15) 


where  m>n.  So  the  increased  complexity  may  be  justified  by  the  in¬ 
creased  classification  ability.  As  before  this  result  is  based  on  the 
assumption  that  no  n+1  points  lie  on  an  n-1  dimensional  hyperplane. 

Since  the  dimension  of  this  mapped  space  is  a  function  of 
the  degree  of  the  polynomials  F_^(X),  the  number  of  dichotomies  of  this 
space  will  also  be  a  function  of  this  degree.  Suppose  the  degree  of 

each  F.(X)  is  r,  i.e.  F,(X)  is  of  the  form 
l  i 


F .  (X)  =  x  x  x  .  o  .  x 
1  kl  k2  k3  kr 


k  =0,  1,  2,  ...  n 
i 


where  we  define  x  =  lo  There  are  n+1  variables  x.  which  we  wish  to 

0  J 

group  in  distinguishable  arrangements  of  the  size  r,  each  arrangement 
forming  one  dimension.  From  Feller6  this  would  result  in  m  =^n+r  !  • 
In  the  case  of  r  =  2  (quadric  classifier) 

m  =  (n+2)(n+l)/2 
=  (n2+3n+2) / 2 


which  is  the  same  as  equation  2.12. 

The  maximum  degree  of  F0(X)  will  be  r  =  n  which  gives 

l  max 

m  =  (2n)!/(n!)2.  This  means  the  upper  limit  of  the  summation  in 
max 

equation  2.15  will  be  much  greater  than  the  corresponding  limit  in 
equation  2.4.  Therefore,  provided  N>m  the  ratio  P(N,n)/L(N,n)  will  be 
large.  In  practical  applications  N  often  approaches  infinity  (i.e. 
the  pattern  space  is  a  continuum)  and  so  we  see  the  classification 
ability  of  the  polynomial  machine  will  greatly  exceed  that  of  a  linear 
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machine . 

2.5  Layered  Machines 

As  an  extension  of  the  ideas  presented  in  the  last  section, 
one  could  consider  the  possibility  of  producing  a  classifier  by  simply 
interconnecting  a  number  of  linear  classifiers  in  any  one  of  a  number 
of  ways.  The  output  from  a  group  of  linear  machines  can  be  used  as  the 
input  to  another  linear  machine.  The  analysis  of  such  an  interconnected 
network  of  linear  machines  is,  in  general,  very  difficult  since  no  one 
has  yet  come  up  with  a  transfer  function  representation  of  a  linear 
machine . 

If  we  restrict  the  interconnection  of  these  linear  machines 
such  that  only  connections  between  layers  of  linear  machines  are  allowed 
with  no  connections  between  separate  machines  within  these  layers 
occurring,  we  can  derive  some  interesting  results.  Under  such  restric¬ 
tions  the  outputs  of  one  layer  of  linear  classifiers  are  used  as  an 
input  to  the  next  layer.  These  machines  are  known  as  layered  machines, 
an  example  of  which  is  shown  in  Figure  2.5.  To  simplify  our  analysis 
of  such  a  machine  it  will  be  assumed  in  the  work  that  follows  that  the 
pattern  space  contains  only  two  categories  which  are  to  be  dichotomized. 
This  means  the  +1  or  -1  output  of  the  final  layer  is  the  response  of  the 
entire  machine. 

From  the  diagram  we  see  that  the  input  to  the  first  layer 
is  the  pattern  vector  while  the  input  to  each  subsequent  layer  is  the 
output  of  the  previous  layer.  This  output,  which  can  be  considered  a 
vector,  contains  elements  restricted  to  take  on  the  values  of  +1  and  -1. 
If  we  assume  that  there  exist  d  linear  machines  in  the  first  layer  and 
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the  output  of  this  layer  is  F  (X) ,  then 

f\x)  =  {F  1  (X)  ,  F„hx),  ....  F  hx)} 

1  2  d 

where 

F  ^ (X)  =  sgn  (W  X.x) 
i  i 

and  sgn  represents  the  signum  function.  Similar  expressions  exist  for 

the  other  layers.  This  first  layer  therefore  produces  a  mapping  of  the 

pattern  space  into  the  verticies  of  a  d, -dimensional  hypercube  with  the 

origin  situated  in  the  centre  of  the  cube.  In  general,  the  j’th  layer 

with  an  input  from  the  i’th  layer,  will  produce  a  mapping  from  a  d  - 

i 

dimensional  hypercube  to  a  d „ -dimensional  hypercube.  The  final  layer 
maps  the  response  of  the  previous  layer  into  the  vertices  of  a  one¬ 
dimensional  cube,  namely,  two  points  on  a  line.  Since  each  element  of 
the  layered  machine  is  a  linear  machine  the  resulting  divisions  of  the 
pattern  space  must  be  linear. 

The  question  then  arises  as  to  how  many  layers  are  required 

to  dichotomize  the  pattern  space  or  whether  there  even  exists  a  finite 

number  of  layers  which  will  successfully  complete  the  dichotomization? 

Let  us  consider  the  second  question  first. 

Suppose  the  pattern  space  contains  two  nonintersecting 

subsets,  A  and  A  ;  E  =  A  UA  and  AH  A  =  0.  Define  a  d-dimensional 
1  2  12  12 

hypercube  by  I^>  I^c:  E  -  As  before  we  consider 


where 


FP  (X)  =  (F  P(X),  ...,  F  P(X)} 


F  P (X)  =  sgn  (W.P-FP  L(X)) 


i  =  { zeld  :  z  =  F  (X) ,  XeA  } ; 

1  t>  1 


Let 
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Z  =  {  zel  :  z  =  F  (X) ,  XeA  } 
1  d  2 


We  then  have  the  following  theorem: 


Theorem  2.1:  If  two  categories  and  A^  exist  in  the 

pattern  space  such  that  A^flA^  =  0  and  there  exists  some 

mapping  B:  En->-l^  then  there  exists  a  finite  number  p  of 

layers  in  a  layered  machine  which  will  produce  a  linear 

dichotomy  of  and  Z^  i.e.  CoZ^OCoZ^  =  0,  where  Z^  and 

Z  are  as  previously  defined,  and  MCo"  means  convex  hull. 

2 

Proof:  We  want  to  know  if  there  exists  p<°°  such  that  CoZ^H  CoZ^  =  0 

when  A  fl  A  =  0.  Suppose  there  does  not  exist  any  such  finite  p.  Then 
lim  (CoZ  H  CoZ  )  ^  0.  This  means  there  exists  a  £  for  all  p<°°  such 

p->°°  12  0 

that  Zq£ (CoZ^O  CoZ^) ,  which  implies  Z^eCoZ^,  Z^eCoZ^  for  all  p.  But 

because  each  Z  in  Z^  and  Z^  is  a  binary  vector,  the  points  making  up 

the  sets  are  discrete  so  that  Z  eCoZ  implies  Z  eZ  and  Z  eCoZ  implies 

0  1  0  l  02 

Z  eZ  .  Since  each  F(X)  is  a  single  valued  function  there  must  exist  an 
0  2 

X  such  that  F  (X  )  =  Z  for  X  eA  and  F  (X  )  =  Z  for  X  eA  ;  which 

0  p  0  0  01  P0  0  02 

implies  X^eA^HA^,  meaning  kf\k^  ±  0.  But  this  contradicts  the  original 

premise  that  A  Pi  A  =  0.  Therefore,  there  must  exist  a  p<°°.  Q.E.D. 

1  2 

We  now  state  a  theorem  which  we  can  use  to  establish  how 

many  layers  are  needed  to  dichotomize  a  pattern  space. 

Theorem  2.2:  If  there  exist  m  hyperplanes  which  form  a 

nonredundant  partition  of  the  two  finite  distinct  point 

sets  A  and  A  ,  a  sufficient  condition  that  CoZ  H  CoZ  =  0 
12  12 

is  that  exactly  m+1  cells  formed  by  the  partition  be 
occupied  by  elements  of  A^  and  A^ ,  where 

Z^  =  {zelm  :  z  =  F (X) ,  XeA^  and  Z ^  =  {zd™  :  z  =  F(X),  XeA2 } 
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and  F(X)  and  Im  are  as  defined  in  Theorem  (2.1),  These 
point  sets  can  then  be  separated  by  a  two-layer  machine. 

The  term  nonredundant  partition  here  means  a  partition  with 
the  property  that  if  one  of  the  separating  hyperplanes  is  removed,  at 
least  two  nonempty  cells  will  merge  into  one  cell.  A  proof  of  this 
theorem  can  be  found  in  Nilsson7. 

If  the  d  partitioning  hyperplanes  of  the  pattern  space  were 
all  parallel  it  would  follow  that  there  were  exactly  d+l  nonempty  cells. 
If  we  restricted  the  subsets  of  the  pattern  space  to  a  finite  number  of 
distinct  points  we  could  further  say  that  these  two  subsets  can  be  non- 
redundantly  partitioned  by  a  set  of  parallel  hyperplanes.  This  follows 
from  the  fact  that  there  exist  only  a  finite  number  of  lines  which  join 
all  these  points  but  there  exist  an  infinite  number  of  directions  in  En« 
To  partition  these  points  by  a  set  of  parallel  hyperplanes  we  need  only 
choose  a  direction  for  these  hyperplanes  which  is  different  from  any  of 
the  directions  of  the  lines  joining  all  these  points.  It  can  therefore 
be  said  that  a  pattern  space  containing  a  distinct  finite  number  of 
pattern  vectors  contained  in  two  subsets  can  be  dichotomized  by  a  two- 
layer  machine. 

We  have  already  restricted  our  consideration  to  pattern 
spaces  for  which  the  mapping  B:  En+Id  exists.  But,  provided  d<«,  the 
verticies  of  the  hypercube  1^  form  two  subsets  and  made  up  of  a 
distinct  finite  number  of  points  where  Y^  corresponds  to  the  mapping 
of  elements  from  and  Y to  the  mapping  of  elements  from  k^.  From 
Theorem  (2.2)  we  can  therefore  say  that  a  two  layer  machine  will  dicho¬ 
tomize  the  subsets  Y  and  Y ^  in  Id.  We  have  therefore  proved  the 


theorem: 
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Theorem  2.3:  Any  pattern  space  containing  two  subsets  for 

n  ci 

which  we  can  find  B:  E  ->-1  (d<°°)  can  be  dichotomized  by  a 

three  layer  machine,  the  first  layer  being  the  mapping  B. 

So  far  we  have  assumed  that  each  element  of  the  layered 
machine  is  a  linear  classifier.  In  such  a  case,  for  the  mapping  B  to 
exist  there  must  exist  a  set  of  hyperplanes  which  divide  the  pattern 
space  into  subsets  each  containing  only  members  from  one  category. 

As  has  already  been  implied  in  the  work  in  this  section,  the 
discriminant  function  for  a  layered  machine  employing  only  linear  class¬ 
ifiers  is  piecewise  linear.  A  derivation  of  this  discriminant  function 
although  important  is  not  particularly  enlightening  and  so  is  not 
included.  Such  a  derivation  can  be  found  in  Nilsson7. 

There  seems  to  be  no  reason  for  restricting  the  elements  of 
a  layered  machine  to  linear  classifiers.  In  particular  if  the  first 
layer  were  a  set  of  $  machines,  a  much  more  complex  discriminant  function 
would  result  and  B  could  be  made  to  map  any  bounded  boundary.  To  the 
best  of  the  author's  knowledge  this  problem  has  never  been  investigated, 
although  it  would  appear  to  have  great  possibilities.  Another  problem 
with  layered  machines  which  has  not  been  solved  is  the  problem  of 
training.  The  availability  of  several  layers,  each  of  which  contains 
several  elements,  naturally  increases  the  complexity  of  any  training 
method.  There  will  be  a  greatly  increased  number  of  weights  to  adjust 
as  well  as  an  increased  number  of  ways  of  adjusting  them.  For  anyone 
interested,  the  work  of  layered  machines  offers  a  number  of  possible 
avenues  of  research. 
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CHAPTER  3  TRAINING  OF  PATTERN  CLASSIFIERS 

3.1  Introduction 

In  this  chapter,  we  will  consider  the  problem  of  training  a 
pattern  classifier  by  the  use  of  a  set  of  pattern  vectors  known  as  the 
training  sample.  This  consists  of  varying  the  elements  of  the  weight 
vector  W  until  a  stage  is  reached  where  any  further  change  in  W  would 
increase  the  number  of  errors  given  by  this  classification  procedure. 

In  section  2,1,  it  was  stated  that  the  pattern  space  can  be 
separated  into  appropriate  pattern  classes  by  the  set  of  discriminant 
functions  f . (X)  (i  =  1,  2,  R)  if  this  set  of  functions  exist  such 


that  f_^(X)>f#(X)  for  XeA^  (i,  j  =  1 ,  2 ,  ,  .  .  ,  R;  i  f  j )  .  In  the  dis¬ 
cussion  in  Chapter  2  on  the  various  types  of  classifiers,  it  was  found 


that  the  discriminant  function  f . (X)  is  always  of  the  general  form 
f.(X)  =  (W*$(X))  where  $(X)  is  some  function  of  the  pattern  vector.  It 


l 

should  be  noted  that  throughout  this  discussion  the  function  $(X)  is 
assumed  to  take  on  many  different  forms  varying  from  a  linear  to  an  r'th 
order  polynomial  function  of  X,  but  f±(X)  is  always  a  linear  function 
of  the  weight  vector  W. 

There  are  two  general  types  of  training  methods,  the  para¬ 
metric  and  the  nonparametric  methods.  In  many  classification  problems, 
the  pattern  classes  are  defined  by  sets  of  parameters.  For  instance, 
when  the  patterns  are  random  variables,  each  class  is  specified  by  some 
distinct  probability  function.  The  parametric  training  method  consists 
of  estimating  the  values  of  the  parameters  defining  the  sets  to  be 
classified  from  a  training  sample  and  determining  a  suitable  discriminant 
function  based  on  these  parameters  and  the  values  of  the  loss  functions. 
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However,  the  type  of  probability  distribution  of  pattern  vectors  in  the 
pattern  space  may  not  be  known.  A  different  type  of  training,  a  non- 
parametric  method,  is  required  for  this  type  of  problem.  In  this  case, 
one  of  several  methods  can  be  used  to  search  for  an  appropriate  discrim¬ 
inant  function  based  on  a  pre-established  risk  function.  The  training 
procedure  then  uses  this  search  technique  to  determine  a  weight  vector 
through  an  iterative  procedure  which  produces  a  minimum  of  the  risk 
function. 

3.2  Parametric  Training  Methods 

To  begin  our  analysis  of  training  methods,  we  must  first 
define  our  cost  function.  Let  C(a/3)  (=0)  be  the  cost  of  misclassifying 

a  pattern  from  the  class  3  by  a  decision  which  places  it  in  class  a. 
(C(a/a)  =  0}.  The  value  of  this  must  be  known  a  priori  but  in  many 
cases  a  rough  estimate  of  its  relative  value  is  sufficient.  The  optimum 
discriminant  function  will  then  be  defined  as  the  one  which  minimizes  the 
cost  of  misclassif ication. 

Suppose,  as  was  done  in  Chapter  2,  that  the  pattern  space, 

En,  consists  of  R  categories  A  (i  =  1,  ...,  R).  That  is 

i 

En  =  A  UA  U  . . o  UA  (3.1) 

1  2  R 

We  can  derive  a  discriminant  function  which  will  divide  the  pattern 
space  into  these  R  categories  by  using  conditional  probabilities.  Let 
the  probability  distribution  of  pattern  vectors  in  category  A^  be  given 
by  p(X/i).  For  the  parametric  training  procedure,  we  either  assume  or 
try  to  approximate  the  form  of  the  density  function.  In  some  cases  it 
may  even  be  known.  To  fully  establish  the  statistics  of  the  classifi¬ 
cation  problem,  we  must  determine  the  probabilities  q^(i  =  1?  • ° • ,  R) 
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of  each  pattern  class  and  the  parameters  which  specify  each  p(X/i)  from 
the  training  samples  given. 

Now  suppose  we  are  given  a  pattern  vector  X.  The  probability 
that  it  comes  from  some  particular  category  A  is  p(j/X)  i.e.  the 

j 

conditional  probability  of  category  A  given  X,  The  cost  of  such  a 

classification  is  C(j/i)  where  A  is  the  category  to  which  X  belongs, 

i 

The  average  expected  cost  will  then  be 

R 

K(i)  =  l  C ( j / i)  p(j/X)  (3.2) 

j  =  l 

The  pattern  X  will  be  assumed  to  belong  to  the  A^  for  which  K(i)  is  a 
minimum.  In  this  way,  the  average  cost  of  misclassif ication  is  mini¬ 
mized. 


By  use  of  Bayes  rule  we  can  write 

p(X/j) 

p(j/X)  = - q  (3.3) 

p(x)  3 

) 

where  q  is  the  probability  of  occurrence  of  category  j,  p (X)  is  the 

j 

probability  of  the  pattern  X  occurring  and  p(X/j)  is  the  conditional 
probability  of  X  given  that  it  belongs  to  category  j.  Equation  3.2 
can  now  be  rewritten  as 

R 

K(i)  =  { 1/p (X) }  l  C(j/i)  p(X/j)  q.  (3.4) 

j  =  l  3 

Since  p(X)  will  be  common  to  all  K(i)  (i  =  1,  2,  ...,  R)  it  can  be 
removed.  We  can  then  define  our  discriminant  function  as 


A  R 

f  (X)  =  L 

1  j  =  l 


C(j/i)  p(X/j)  q 


(3.5) 


with  classification  taking  place  according  to  the  rule 

f  (X) >f  (X)  implies  XeA,  (i,j  =  1,  2,  ...,  R;  i^j) 

i  j  3 
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From  this  we  see  that,  assuming  C(j/i)  is  given  for  all  i  and  j,  all 
that  is  needed  is  an  estimate  of  each  p(X/j)  and  the  probabilities  q  . 

A  machine  employing  such  a  discriminant  function  is  often  called  a 
Baye's  Machine.  Anderson^  has  shown  that  a  Baye's  machine  yields  the 
optimal  solution  in  that  it  minimizes  the  the  cost  of  misclassif ication. 

Consider  the  case  when  the  training  sample  contains  N. 
pattern  vectors  from  category  A^  for  each  i.  Since  the  training 
sample  is  supposed  to  represent  the  pattern  space  sufficiently  for 
learning,  our  best  estimate  of  q  is 

R 

q.  -  n./  I  n. 

1  1  j=i  J 

The  density  function  p(X/j),  is  more  difficult  to  find.  Methods  do 
exist,  however,  for  approximating  the  density  and  the  distribution 
functions  from  a  given  set  of  independent  samples  taken  from  the  prob¬ 
ability  space.  One  such  method  is  given  by  Kashyap  and  Blaydon10. 

Other  procedures  suitable  for  classification  problems  are  also  available. 
If,  however,  the  form  of  the  density  function  is  known  or  can  be 
assumed,  the  problem  of  training  is  greatly  simplified  since  we  need 
only  estimate  the  parameters  forming  the  distribution  in  order  to 
specify  p(X/j) . 

One  of  the  most  common  distributions  of  patterns  in  the 
pattern  space  is  the  Gaussian  or  normal  distribution.  The  significant 
parameters  of  this  distribution  are  the  mean  and  variance.  Let  the 
vector  whose  elements  are  the  mean  values  of  the  n  variables  of  the 
distribution  be  given  by  M  and  let  the  matrix  containing  the  variances 
and  covariances  be  £ .  The  normal  distribution  can  then  be  written  in 
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the  form: 

P (X)  =  {  1/2tt  |  Z  |  2}  exp{-^(X-M)tZ~1(X-M)}  (3.6) 

where  |z|  denotes  the  determinant  of  Z.  So  we  have 

p(X/j)  =  {1/2tt|e.  I*2}  exp {“*2 (X-M. )  CL  (X-M,  )  } 

J  J  J  j  (3>7) 

For  the  two  category  case  the  discriminant  functions  become 

f^CX)  =  C ( 1/ 1) { qi/2u | Z ^ |^}  exp{-^(X-M1)tZ1"1 (X-M^) }  + 

,  H  t 

+  C(2/l){q  /2tt|Z  |  }  exp {-^ (X-M  )  2  -1  (X-M  )} 

/  ^  2  2  2 

and 

f2(X)  =  C(l/2){qi/2n|zi|55}  exp{-%(X-M1)tEi-1(X-M  )}  + 

+C(2/2){q/27r|20|^}  exp  {  -%  (X-M,  )  1 1  ^(X-M  )} 

2  2  r  2  2  2 

As  was  mentioned  earlier  there  is  generally  no  loss  connected  with 
correct  classification  (i.e.  C(i/i)  =  0).  Because  we  are  dealing  with 
the  two  category  situation  we  can  reduce  the  discriminant  function  to 

f (X)  =  f2(x)  -  f  (X)  with  classification  taking  place  according  to 
f(X)>0  implies  X  e  and  f(X)<0  implies  X  e  A^.  Because  the  logar¬ 
ithm  is  a  monotonically  increasing  function  of  its  argument,  the  same 
classification  would  result  if  we  were  to  consider  log  f^(X)  instead  of 

f  (X) .  These  simplifications  reduce  the  discriminant  function  to 
i 

f(X)  =  y(X-M2)tE2-1(X-M2)  -  (X-M  ^E  -1  (X-M  )  + 

+  2  log  C(l/2)/C(2/l)  +  log  |Z  | / | S  |  + 

2  1 

+  2  log  q^/q^)  (3.8) 

As  can  be  seen,  this  function  is  a  quadratic  function  of  X  and  is  the 
special  case  mentioned  in  section  2.3.  To  obtain  the  special  case 
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mentioned  in  section  2.2  case  2  we  set  £^  =  Z^  =  E  and  obtain 

f(X)  =  XtZ“1(M  -M  )  +  %(M  *£“1*1  -  M  tZ“lM  )  + 

1  2  2  2  1  1 

+  log  C(l/2)/C(2/l)  +  log  q^q  (3.9) 

which  is  a  linear  function  of  X. 

The  parametric  training  for  the  case  of  a  Gaussian  distribu¬ 
tion  consists  of  estimating  the  values  for  £  ,  M  and  q  for  all  i,  by 

i  i  i 

use  of  the  training  samples.  Anderson  has  shown9  that  if  a  set  of  N 

i 

samples  is  available  from  a  category  having  a  Gaussian  distribution  of 
pattern  vectors  the  maximum  liklihood  estimate  of  the  mean  is 

N. 

<M  >  =  { 1/N  >  (3.10) 

1  a=l  a 

and  the  maximum  liklihood  estimate  of  the  variance  would  be 

Ni 

<£ .  >  =  U/N  }  J1  (X  -  <M.  >)  (X  -  <M  >)Z  (3.11) 

i  i  ,  a  i  a  i 

a=  1 

assuming  that  the  measurements  in  the  pattern  space  were  independent. 

The  values  of  q  can  be  estimated  as  previously  shown.  Improvements  on 

i 

these  estimates  can  be  made  if  the  knowledge  of  the  distribution  is 

increased.  For  example,  if  the  measurements  are  not  independent,  i.e. 

the  covariance  terms  in  Z  are  non-zero,  Anderson  shows  an  improved 

i 

method  of  solving  for  the  matrix  Z  .  If  the  variance-covariance  matrix 

i 

is  known  a  priori  an  improved  estimate  of  the  mean  vector  can  be  obtained 
as  shown  by  Nilsson7. 

In  his  work  in  pattern  recognition  Cooper1 1  proposes  a 
probability  distribution  given  by 
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p  (X)  =  {mr(n/2)/2r(n/m)7T^n|  Z  1^}  expMX-M)^  1  (X-M) 
m 

(3.12) 

where  n  is  the  number  of  dimensions  of  the  pattern  space  and  T  is  the 

gamma  function.  This  reduces  to  the  Gaussian  distribution  for  m  =  2. 

The  advantage  of  using  this  distribution  over  the  Gaussian  distribution 

can  be  seen  from  the  work  done  by  Ahmed12  on  this  function.  Ahmed  proves 

the  following  result.  As  nr*30  p  (X)  converges  to 

m 

p(x)  =  r(%n  +  1)  /  \z\W^n  for  all  X  e  S 

=  0  for  all  X  t  S  (3.13) 

where 

S  =  {XeEn :  / (X-M) C£_ 1 (X-M)  <1} 

From  this  it  can  be  seen  that  by  increasing  m  the  distri¬ 
bution  can  be  made  to  drop  off  to  zero  much  quicker  than  the  Gaussian 
distribution.  Such  a  probability  function  will  be  of  great  use  for 
describing  the  separation  of  categories  in  the  pattern  space  when  it  is 
known  with  great  certainty  that  the  patterns  cannot  lie  outside  a 
certain  set. 

Beginning  with  equation  3.5 

R 

f  (X)  =  V  C(j/i)p(X/j)q. 

i  .  i  J 

J  =  1 

a  discriminant  function  for  this  distribution  can  be  derived  in  the 
same  way  as  for  the  Gaussian  case 

i^n 

p(X/j)  =  {m  r  (n/2)  /  2T  (n/m, )  tt  2  |z.|}  * 

F  J  j  J  J 

x  exp{-(X-M.)tz._1(X-M.)}^mj  (3.14) 

J  J  J 

If  we  once  again  limit  our  discussion  to  the  two  category  cases  with 
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C(i/i)  0  and  use  the  form  f(X)  =  log  f^.(X)  -  log  f^(X)  with  classi¬ 
fication  according  to  f(X)<0  for  X  e  and  f(X)>0  for  X  £  A^  we  get 

f(X)  =  ^m2(X"M2^tz2"1(X"M2)  “  J5mi(x“M1)t21’1(X-M1>  + 
+  log  C ( 1  / 2 )  /  C ( 2 / 1 )  +  log  m^/m2  1°8  q^/q  + 

+  log  r (n/m^)  /  r(u/m^)  +  ^log  |l2|/|z  |  (3.15) 


As  before  we  can  see  the  optimal  solution  for  this  distribution  will 

be  quadratic  in  X  and  if  E  =  the  resulting  function  will  be  linear 

1  ‘ 

in  X. 


The  training  required  for  this  distribution  will  differ  from 


the  Gaussian  case  only  in  that  the  values  for  m,  and  m  must  be  estab- 

1  2 

lished. 


3.3  Nonparametric  Training  Methods 

In  the  study  of  nonparametric  training,  we  consider  classi¬ 
fication  problems  in  which  an  a  priori  knowledge  of  the  distribution  of 
pattern  vectors  in  the  pattern  space  is  unnecessary.  To  train  machines 
for  classification  under  such  conditions,  we  have  only  the  knowledge 
which  can  be  gained  from  the  training  samples  to  guide  us.  This  type 
of  training  is  probably  more  important  than  the  parametric  method 
because  the  parametric  method  begins  by  assuming  some  distribution  and, 
consequently,  the  resulting  classification  cannot  be  any  more  accurate 
than  the  assumption. 

We  will  assume  in  this  section  that  a  set  of  pattern  vectors 
making  up  the  training  sample  is  available  and  the  correct  classifi¬ 
cation  for  each  vector  in  this  set  is  also  available.  These  vectors 
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will  be  invariant  and  so  we  will  be  concerned  with  variations  in  the 
weight  vector  and  the  space  from  which  this  vector  comes.  We  will  not  be 
interested  in  the  pattern  space.  The  Euclidean  space,  En+1'\  which 
contains  all  the  weight  vectors,  may  be  thought  of  as  a  dual  space  to 
the  augmented  pattern  space.  We  will  call  this  space  the  weight  space. 

In  the  work  that  follows,  the  pattern  space  will  also  be  assumed  to  be 
(n+1) -dimensional ,  i.e.  the  pattern  vectors  will  all  be  assumed  to  be 
augmented  as  explained  at  the  beginning  of  Chapter  two. 

As  mentioned  in  equation  2.2,  for  classification  to  have 
any  physical  meaning,  it  is  required  that 

a  n  a  n  ...  ra  =0 

1  2  R 

In  this  section,  we  will  assume  that  this  condition  exists.  This  will 
enable  us  to  assume  the  existence  of  a  weight  vector  which  will  opti¬ 
mally  dichotomize  the  pattern  space  provided  it  is  used  in  conjunction 
with  a  proper  $  processor.  Let  the  set  of  all  these  solution  weight 
vectors  be  given  by 

4-1  * 

V  =  {WeEn  :  (w  6F(X))  >  (W.-F(X))  for  all 

1  1 

X  e  A  ;  i  =  1 ,  .  .  .  ,  R} 
i 

The  problem  of  training  is  then  equivalent  to  choosing  a  W  in  V. 

To  begin  the  training  process,  an  arbitrary  (n+l)-dimen- 
sional  vector  is  chosen  for  W.  The  vectors  forming  the  training  sample 
are  then  presented  to  the  classifier  one  at  a  time  and  the  response  of 
the  classifier  is  compared  to  the  desired  response.  Each  time  the 
response  is  incorrect  an  adjustment  is  made  on  the  weight  vector.  This 
procedure  is  continued  with  the  training  set  being  presented  as  many 
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times  as  necessary  until  the  classifier  can  correctly  classify  all 
patterns  in  the  training  set.  To  explain  the  procedure  more  fully, 
let  us  consider  a  two  category  case.  Suppose  the  pattern  space  con¬ 
tains  two  categories  A  and  A  with  classification  determined  by 

1  2 

(W'F(X))  >  0  for  X  e  A^  and  (W*F(X))  <  0  for  X  e  A^ •  Consider  the 
first  time  an  error  in  classification  occurs.  We  then  have  the  con¬ 
dition  (W’F(X))  <  0  for  X  e  A^  or  (W*F(X))  >  0  for  X  e  A^ •  A  correction 
must  be  made  in  W  to  produce  a  correct  classification;  in  the  first 
case  W  must  be  increased,  and,  in  the  second  case  W  must  be  decreased. 

It  is  well  known  that  the  direction  of  maximum  increase  of  a  function 
is  along  the  gradient  while  the  direction  of  maximum  decrease  is  along 
the  negative  of  the  gradient.  Solving  for  the  gradient  of  the  dis¬ 
criminant  function  in  the  weight  space  we  get 


V (W* F (X) )  =  (9/9w  ,  8/9w2» 
=  F(X) 


. ..,  9/9w  ) (W* F (X) ) 

n+1 

(3.16) 


This  means  that  no  further  mechanical  calculations  are  required  to  find 

the  gradient  because  F(X)  is  already  available.  This  makes  this  method 

convenient  for  implementation. 

We  now  know  what  the  direction  of  the  correction  vector 

should  be,  we  only  need  to  know  how  large  the  correction  should  be. 

Let  the  fraction  of  F(X)  which  we  use  be  c.  For  an  incorrect  guess  on 

the  k'th  step  of  training  the  correction  will  then  be 

W (k+1)  =  W (k)  ±  cF  (X)  (3.17) 

k 

For  a  correct  guess,  the  weight  vector  remains  unchanged: 

W(k+1)  =  W (k)  (3.18) 


Many  different  suggestions  for  the  value  of  c  have  been  given.  The 
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simpliest  is  to  make  c  some  arbitrary  constant  greater  than  zero.  This 
leads  to  the  fixed  increment  error  correction  procedure.  Another  method 
gives  c  the  value 

c  =  A (w(k) •  F  (X))  /  | Fk(X) | 2  (3.19) 

For  A  =  1,  this  is  the  smallest  number  which  must  be  added  to  W(k)  to 

change  the  sign  of  (W(k)*F.  (X)).  For  this  value  of  A  this  method  is 

k 

called  the  absolute  correction  rule. 

To  prove  that  such  a  training  process  does  converge  to  a 
solution  vector  in  V,  the  following  modifications  are  introduced.  Each 
time  a  pattern  is  presented  and  is  correctly  classified  the  weight 
vector  remains  unchanged.  For  this  reason,  we  will  consider  a  reduced 
training  set  which  contains  only  those  patterns  for  which  the  classifier 
gives  an  incorrect  response.  This  means  that  a  correction  is  made  after 
each  training  pattern  in  this  reduced  set  is  presented  to  the  classi¬ 
fier.  Such  a  presentation  and  correction  is  called  a  step  of  learning. 
One  further  change  is  to  negate  all  vectors  F(X)  derived  from  the 
vectors  X  of  the  training  sample  which  belong  to  category  .  This 
means  (W*F(X))  <  0  will  be  incorrect  for  all  X  and  the  correction 
required  will  be 

W (k+1)  =  W(k)  +  cF (X) 

These  modifications  do  not  change  the  basic  training  procedure,  but 
will  simplify  our  proof  of  convergence. 

We  can  now  say 

W(k+1)  =  W(k)  +  cF  (X) 

K. 

W(k+1)  -  W  =  W(k)  -  W  +  cFk(X) 


so  that 
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where  W  is  a  vector  from  the  solution  region  V.  Multiplying  through 
by  _F  (X)  8ives 

W*F  (X)  -  W(k+1)  *F  (X)  =  W’F  (X)  -  WCkW  (X)  - 
k  k  k  k 

-  c|F  (X) |2 
k 

Since  c  is  positive 

c  I F  (X)  I  2  i  0 
k 

SO 

(W  -  W(k+l))-Fk(X)  ^  (W  -  W(k)>Fk(X) 

Now  let 

F  (X)  =  W  -  W(k+1) 
k 


If  no  such  vector  exists  in  the  training  sequence  we  can  introduce  one 
into  the  sequence.  We  then  have 

(W  -  W(k+1) ) • (W  -  W(k+1))  ^  (W  -  W (k) ) • (W  -  W(k+1)) 


But 


Therefore 


W(k+1)  =  W (k)  +  c (W  -  W(k+1) ) 

=  (W(k)  +  cW}  /  1+c 

(W  -  W(k) ) • (W  -  W(k+1))  =  (W  -W (k) )  x 
x  (W  -  (W(k)  +  cW)  /  1+c) 

=  (1  /  1+c) (W  -  W (k) ) * (W  -  W(k) ) 


Now  c>0  so 

(1  /  1+c)  (W  -  W  (k)  )  *  (W  -  W  (k)  )  <  |W  -  W  (k) 


therefore 

| W  —  W(k+1) j  <  jw  -  W (k) | 

This  shows  that  such  an  error  correction  method  does  converge  to  a 


solution  vector. 


' 


33V Xg  (x;  '  d 


-  (X)  J-U)W  -  (  )  .IV'  -  (X)  * •(!+*)*  -  (x)  ri*w 


gvilJraoq  Bi  a  safllB  - 


.(  ,;)  -  (X)  '  I  : 


3aX  woVl 


3ufl 


)+J  \  {W  3  +  (A/vO 


9"10^.9Z9dT 


((Hof)1,!  •  W)  •  (  (:,.»r  -  W) 

) •  (a)  -  w) (j  i  \  l, 


00 w  -  w|  >  ( (>0W  -  W)‘(U)W  -  W)(o+f  \  I) 


oe  0<o  woH 


sdX 


<  V[;od  a.yob  bori.i  >nr  r  )J  jsti-xod  loi  9  *.'J  i  3^'  >  T 


36 


This  training  procedure  can  be  extended  to  the  multicate¬ 
gory  case  as  well.  In  such  a  case,  there  would  exist  a  set  of  R  weight 
vectors  to  be  adjusted.  Classification  in  this  case  is  determined  by 
x  e  A.  if  F(X)*W^  >  F(X)*W,  (j  =  1,  2,  . . , ,  R;  j  ^  i) .  The  correction 
procedure  in  the  multicategory  case  for  misclassif ication  by  placing 
X  in  category  h  when  it  actually  belongs  to  category  i  at  the  k’th 


step  of  training  is 


W  (k+1)  =  W.  (k)  +  cF.  (X) 
i  lk 

W,  (k+1)  =  W,  (k)  -  cF.  (x) 
h  h  k 

W. (k+1)  =  W.(k)  (j  =  1,  2,  . . . ,  R;  j  ±  h,  j  ±  i) 

3  J 

As  before,  all  the  weight  vectors  remain  unchanged  if  a  pattern  is 
correctly  classified.  In  this  correction  procedure,  c  can  be  defined 
in  the  same  way  as  in  the  single-category  case.  Nilsson7  has  shown 
that  such  an  error-correction  procedure  does  converge  to  a  solution 
weight  vector. 

This  particular  approach  to  error  correction  is  not  a 
unique  approach.  Many  authors  have  given  different  versions  of  this 
same  basic  approach,  the  original  author  being  Rosenblatt13.  Other 
nonparametric  training  methods  have  been  proposed  but  none  is  as  simple 
to  implement.  Two  of  these  methods  will  be  discussed  and  compared  in 
the  following  chapter.  An  excellent  review  of  the  major  training 
methods,  both  parametric  and  nonparametric,  is  given  by  Nagy14  in  a 
very  recent  paper.  In  this  paper,  he  also  gives  a  comparison  of 
these  methods.  The  error-correction  method  just  described  compares 
quite  favorably,  with  the  others. 


a 

oo±*3»nb*  3iiT  .(1  *  C  S*  <  .S  <1  -  t>  V-iW  <  /’(*>*  **  tA  3  X 

■d 

siriJ  lo  anoxa'iav  Pnsxsmb  navis  avfcrf  sxoxiduB  ynsM  .HoBo^qq®  3i/pJ:nu 

n±  b9TBqfliop  bns  beaau  alb  ad  lliw  aboriisn  saarfd  :  ewT 
;;j  .  ;  •  f.  '  ,  ■  r  9  '  '.: 

fi  ni  +J  gi  *,  v  :!  .  qnon  bns  o  L  dawi  I^c  '! 

E9^  3Iuoo  b9cL.*ioa  b  5  2'j  ;  b'->r  i  m  i:3P9TX0 :j  -loixs  ar  *b<dj9  et  d 


37 


CHAPTER  4  COMPUTER  SIMULATION  OF  SOME  TRAINING  METHODS 

4.1  Introduction 

The  current  research  in  the  area  of  training  algorithms  for 
pattern  classifiers  deals  almost  exclusively  with  nonparametric  training. 
For  realistic  pattern  recognition  problems,  nonparametric  training  is 
advantageous  because  it  does  not  require  a  priori  knowledge  of  the  pattern 
sets.  In  practice,  this  knowledge  will  not  likely  be  available.  This 
chapter  includes  a  review  of  two  of  the  more  recent  nonparametric 
training  methods  along  with  the  results  of  a  computer  simulation  of  a 
pattern  recognition  problem  using  them. 

As  was  mentioned  in  section  3.1,  nonparametric  training  pro¬ 
cedures  attempt  to  determine  a  weight  vector,  known  as  the  optimal  weight 
vector,  which  minimizes  the  risk  or  cost  of  misclassif ication.  Without 
a  complete  knowledge  of  the  pattern  classes,  this  optimal  vector  cannot 
be  calculated  directly.  For  this  reason,  a  search  technique  is  intro¬ 
duced  to  find  the  minimum  in  the  weight  space  of  the  cost  function. 

There  exist  many  possibilities  for  deriving  a  training 
algorithm,  primarily  because  the  functional  description  of  the  cost  of 
misclassif ication  can  be  established  in  several  different  ways.  Once 
this  cost  function  is  established,  any  one  of  several  mathematical  search 
techniques  can  be  used  to  search  the  weight  space  for  a  weight  vector  to 
minimize  this  function.  One  of  the  most  popular  search  techniques  is 
the  steepest  descent  method  which  minimizes  a  function  along  the  nega¬ 
tive  of  its  gradient. 


Many  new  and  original  methods  appear  in  the  literature 
almost  continually.  One  completely  new  method  has  been  proposed  by 
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Chien15.  This  method  assumes  neither  any  knowledge  of  the  distribution 
of  pattern  samples  nor  any  knowledge  of  the  number  of  categories  present. 
While  a  critique  of  all  the  methods  that  have  been  proposed  so  far  would 
seem  to  be  desirable,  such  a  study  could  not  be  attempted  due  to  limi¬ 
tations  of  time  and  space.  Instead,  a  comparative  study  of  two  of  the 
algorithms  presented  in  recent  papers  will  be  discussed  in  this  chapter. 

4.2  The  Classification  Problem. 

The  idea  of  a  computer  simulation  of  training  methods  was 
provided  by  a  paper  by  J.D.  Patterson  and  B.F.  Womack"6.  This  paper 
contains  a  qualitative  description  of  a  nonparametric  training  algorithm 
and  the  results  of  a  simulation  study  using  this  training  method  on  four 
separate  classification  problems.  All  the  classification  problems  treated 
in  this  paper  consist  of  two  sets  of  bivariate  normally  distributed 
pattern  vectors  with  different  means  and  variances.  In  checking  the 
results  of  this  paper,  an  ambiguity  was  found  which  will  be  explained 
later.  For  this  reason  it  was  decided  to  study  the  training  problem 
presented  in  the  paper  to  try  to  establish  the  possible  source  of  this 
ambiguity . 

Suppose  that  we  have  two  categories  from  which  the  pattern 
vectors  are  derived.  We  will  call  these  categories  and  „  The 
discriminant  function  will  divide  the  pattern  space  into  two  half  spaces: 

V  =  (X:  f (X)  >0}  V  =  (X:  f(X)  <0} 

The  decision  rule  is:  if  X  e  V  decide  X  e  C^,  if  X  e  decide  X  e  C2 . 
The  cost  of  misclassifying  a  pattern  from  C.^  by  placing  it  in  C  will 

D 

be  given  by  C(8/a),  (a, 8  =  1,  2). 

The  optimum  discriminant  function  for  dichotomizing  a  pattern 
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space  containing  two  categories  of  Gaussian  distributed  pattern  vectors 
can  be  calculated  directly  by  using  equation  3.8.  The  resulting  dis¬ 
criminant  function  is  called  the  Baye’s  discriminant  function  for  a 
Gaussian  distribution.  The  means  and  variances  for  the  four  problems 
are  given  in  Table  4.1. 

In  the  following  work,  the  two  algorithms  used  for  training 
will  be  compared.  One  of  the  most  obvious  ways  of  comparing  the  two 
methods  is  to  compare  the  number  of  errors  each  one  produces  on  a  given 
sample  set  after  training.  The  way  to  produce  the  minimum  number  of 
errors  is  to  set  the  costs  of  misclassif ication  equal  and  the  probability 
of  occurrence  of  the  sample  sets  equal.  Substituting  these  values  into 
equation  3.8  we  get  the  following  discriminant  functions: 

f(X)  =  (-4x  -  x2  +  5)  /  2  (4.1) 

f(X)  =  2.69  -  (3x  2)  /  32  -  2x  (4.2) 

2  1 

f(X)  =  (22.18  -  15xl2  +  12*22)  /  32  (4.3) 

o  2 

f(X)  =  (44.35  -  12x  2  -  3x  )  /  32  (4.4) 

1  ^ 

The  Baye’s  solution  with  C (1/2)  =  C (2/ 1)  and  q  =  q2  produces 
an  equal  probability  of  misclassif ication  from  either  category  for  a 
continuous  distribution.  This  probability  can  be  obtained  by  integrating 
one  distribution  over  the  space  on  which  it  can  produce  a  misclassifi- 
cation  or  taking  the  integral  over  the  space  on  which  it  is  correctly 
classified  and  subtracting  this  from  1.  Choosing  for  the  sake  of  sim¬ 
plicity  the  distribution  of  category  one  we  get 

P (Error)  =  /  p  (X)  dX 

V  1 
2 

-  /  p  (X)  dX 

V  1 
1 
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P (ERROR)  FROM 

P (ERROR)  FROM 

P (ERROR)  FROM 

ESTIMATED 

RESULTS  OF 

EXACT  BAYE'S 

BAYE'S  RULE 

ADAPTIVE 

PROBLEM 

MEANS 

VARIANCES 

RULE  SOLUTION 

SOLUTION 

PROCEDURES 

M  = 

0 

1  0 

E  = 

1 

_0j 

1 

0  4 

1 

M  = 

2 

I  = 

"l  o 

.  1317 

.095 

.085 

2 

2 

2 

_0  4 

M  = 

0 

I  = 

”i  o 

1 

0 

1 

LO  4j 

2 

M  = 

2 

I  = 

"l  0 

.1307 

.  105 

.110 

2 

LoJ 

2 

0  16 

0 

1  0 

M  = 

E  = 

1 

Lo 

1 

0  4 

3 

M  = 

0 

E  = 

~16 

o 

.0989 

.  150 

.230 

2 

la 

2 

0 

1 

fo 

~i  o 

M  = 

E  = 

1 

Lo. 

1 

0  4 

4 

.  1574 

.220 

.270 

[0 

[4  0 

M  = 

E  = 

2 

0 
_ _ 

2 

0  16 

TABLE  4.1  SUMMARY  OF  PATTERSON  AND  WOMACK’S  RESULTS  TOGETHER  WITH  THE 
PROBABILITY  OF  ERROR  {P (ERROR)}  FROM  THE  EXACT  BAYE’S  RULE 
SOLUTION 
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After  calculating  this  integral  for  the  four  cases*,  we  get  the  solutions 
given  under  the  heading  'P (ERROR)  FROM  EXACT  BAYE'S  RULE  SOLUTION'  in 
Table  4.1. 

At  this  point,  we  have  a  disagreement  with  the  authors  of  the 
aforementioned  paper.  In  their  results,  they  have  listed  the  probabilities 
of  misclassif ication  due  to  the  results  of  their  adaptive  training  and 
also  due  to  an  estimated  Baye's  Rule  classification.  These  are  given  in 
Table  4.1.  As  can  be  seen  these  values  differ  from  those  obtained  by 
using  exact  Baye's  rule  solution.  The  results  for  cases  one  and  two 
are  especially  interesting  since  their  estimation  procedures  have  yielded 
probabilities  lower  than  the  theoretically  optimum.  For  this  reason,  we 
decided  to  apply  the  training  algorithm  given  in  their  paper  to  the  first 
categorization  problem  to  determine  the  accuracy  of  the  remainder  of  the 
paper . 

The  training  problem  is  to  find  a  decision  boundary  between 

two  normally-distributed  pattern  sets  which  minimizes  the  classification 

error.  Classification  takes  place  according  to  X  e  C^  if  f (X)  >  0  and 

X  e  C  if  f(X)  <  0.  The  theoretically  optimal  decision  boundary  is  given 
2 

by  setting  f(X)  from  equation  4.1  equal  to  zero.  For  the  simulation 

study,  the  continuous  Gaussian  distributions  are  approximated  by  50 

pattern  vectors  from  each  distribution.  The  method  used  for  obtaining 

a  Gaussian  distribution  of  these  pattern  vectors  is  one  recommended  by 

IBM  for  such  purposes19.  Assume  the  points  z  are  uniformly  distributed 

i 

in  the  interval  (0,1).  An  approximate  normal  function  with  0  mean  and 
a  standard  deviation  of  1  obtained  from  the  z^  is 


*See  Appendix  A  for  detailed  calculations. 
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N 

Y'  =  (  l  z.  -  N/2)  /  /N/12'  (4.6) 

i=l  1 


This  can  be  changed  to  a  normal  function  of  mean  m  and  standard  devia¬ 
tion  a  by  taking 


Y  =  aYT  +  m 


(4.7) 


The  were  obtained  by  using  the  random  number  generator.  After  a 
number  of  trials  N  was  chosen  to  be  179.  To  approximate  a  uniform 
distribution  in  (0,1),  179  numbers  between  0  and  10,000  were  randomly 
chosen  and  each  was  divided  by  10,000.  This  gave  the  z  from  which  the 


l 

Y  was  calculated. 

A  univariate  normal  distribution  has  a  probability  of  .683 

of  the  variable  lying  within  the  standard  deviation  of  the  mean.  For 

2 

a  bivariate  normal  distribution,  this  probability  is  (.683)  or  .4665. 
In  the  sample  patterns  used,  44%  of  category  2  and  48%  of  category  one 
lie  within  the  standard  deviation  of  the  mean.  The  sample  patterns 
seem  to  be  fairly  accurate  representations  of  the  normal  distributions 
for  which  they  were  intended. 

One  further  check  was  made  by  applying  the  optimal  discrim¬ 
inant  function  given  by  equation  4.2  to  this  simulated  pattern  space. 
This  resulted  in  fifteen  errors,  eight  from  category  one  and  seven  from 
category  two.  This  means  an  overall  probability  of  error  of  .15  com¬ 
pared  with  .1317  for  the  theoretical  probability. 


4.3  Mean-Square-Error  Classifier 

In  this  section,  we  will  discuss  the  training  algorithm 
proposed  by  Patterson  and  Womack16.  In  their  paper,  they  have  presented 
the  criterion  for  classification  and  given  the  results  of  a  simulation 
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study.  The  search  technique  used,  which  they  have  called  'pattern 
search',  is  mentioned  but  not  discussed  in  their  paper.  For  this  reason, 
the  steepest  descent  search  technique  is  applied  to  the  criterion  given 
in  the  paper.  This  algorithm  will  then  be  applied  to  the  first  classi¬ 
fication  problem  given  in  section  4.2. 

The  criterion  for  classification  can  be  established  by  con¬ 
sidering  the  discriminant  function  as  a  mapping  from  the  pattern  space 
to  the  real  line  which  may  be  considered  the  'decision  space.'  In 


performing  this  mapping  it  is  desirable  to  map  the  points  from  C  as 
near  as  possible  to  some  point  which  is  a  finite  distance  from  the 
point  to  which  the  points  from  map.  Since  we  desire  f (X)  >  0 
for  X  e  C^,  the  constant  will  be  positive  and  similarily  we  see 
will  be  negative.  To  weight  the  classification  according  to  the  pre¬ 
determined  costs  of  misclassif ication,  let  K  =  C ( 2 / 1 )  and  K  =  — C ( 1 / 2 ) 

1  2 

One  method  of  measuring  the  effectiveness  of  a  discriminant  function 


would  then  be  to  measure  the  error  produced  in  this  mapping.  For  this, 

we. use  the  mean-square-error  M: 

- 2~ ( 1)  - y-(2) 

M  =  {f(X)  -  C(2/l)r  +  { f (X)  +  C(l/2)r  (4.8) 

where  denotes  averaging  over  category  i.  Taking  an  estimate 

of  the  probability  of  occurrence  of  the  two  populations  to  be  q^  and  q,? 

and  the  number  of  samples  from  each  population  to  be  and  N^,  equation 

4.8  may  be  written  as 

N  2 

M  =  (q  /N  )  Y  {f(X  )  -  C  (2/ 1) }  + 

1  1  c^l 


N2 

+  (q  /N  )  l  {f(X  )  +  C ( 1/2) } 
2  2  [5=1  6 


2 


(4.9) 
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The  steepest  descent  method  will  be  applied  to  the  mean- 
square-error  criterion  to  develop  a  search  algorithm.  The  mean-square- 
error  is  a  functional  of  the  discriminant  function  which  in  turn  is  a 
function  of  the  pattern  vector  and  the  weight  vector.  The  mean-square- 
error  is  changed  by  successive  adjustments  of  the  weight  vector.  To 
minimize  the  mean-square-error  in  a  minimum  of  time,  we  need  to  know 
its  gradient,  VM,  since  the  negative  of  the  gradient  will  give  the 
direction  of  maximum  decrease  or  steepest  descent.  If  a  series  of 
infinitesimal  steps  were  taken  in  the  direction  of  the  negative  grad¬ 
ient  the  change  in  M  would  always  be  in  the  direction  of  maximum 
decrease.  Since  this  would  require  an  infinite  number  of  steps  to  move 
a  finite  distance,  some  finite  distance  A  must  be  chosen  in  the  direc¬ 
tion  of  -VM.  Denoting  -VM  by  s  we  get 

N1  2 
M(W+As)  =  (q  /N  )  V  { f (W+As  *  X  )  -  C(2/l)}z  + 

1  1  a 


?2 


+  (q  /N  )  l  { f  (W+As  •  X  )  +  C  ( 1  / 2 )  } 

2  2  3=1  3  (4.10) 


Solving  for  the  gradient*  we  get 


s 


x  C (2/ 1) }  + 


N 

+  (q  /N  ){(  f  A(X  )) 
2  2  6=1  8 


x  W  + 


N9 

Y2 

Z 

3=1 


F(V 


X  C (1/2)  } 
(4.11) 


where  F(X)  is  the  augmented  $  function  of  the  pattern  vector  and  A(X) 
is  the  matrix  made  by  taking  the  outer  product  of  F(X)  with  itself. 


*See  Appendix  B  for  calculations. 
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Now  let  us  define 


N 

A  =  (q/N  )  V1{f(s-X  )[f(W-X  )  -  C(2/l)3  } 
1  1  1  “  “ 


N. 


A,  =  (q-/N  )  p{f(s-X  )Cf(W-X  )  +  C(l/2)]  } 

2  2  2  6-1  6  P 

In  terms  of  these  new  variables,  the  minimum  of  M(W  +  As)  is  given  by 


* 


A 


N1  2 

=  -Uj  +  y  /  <yy  i  f(s-v  + 

a=  1 


N  ? 

+  (q/N  )  f (s  *X  ) 

2  2  ^  6 


(4.12) 


These  results  can  now  be  used  for  nonparametric  training 
of  a  pattern  classifier.  The  various  steps  are: 

1)  The  samples  together  with  their  correct  classi¬ 

fications  are  given  to  the  classifier  which  makes  the 
indicated  calculations  on  these  vectors  and  produces 

a  value  for  A  and  s. 

2)  These  values  are  then  used  to  change  the  weight  vector 
W  to  a  new  value  of  W  +  As. 

3)  The  process  is  then  repeated  using  the  samples 

each  time  until  the  mean-square-error  is  near  enough 
to  zero. 

A  schematic  of  this  scheme  is  given  in  Figure  4.1.  For  a  definition 
of  the  terms  used  see  the  calculations  in  the  appendix. 

Because  the  distributions  of  pattern  vectors  to  be  classi¬ 
fied  are  Gaussian,  a  quadratic  $  processor  is  used  in  the  simulation 
study.  This  also  allows  a  comparison  of  the  simulation  results  with 


*See  Appendix  B  for  calculations. 
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the  results  given  by  Patterson  and  Womack  as  they  also  used  a  quadratic 
discriminant  function.  This  gives  the  discriminant  function  the  general 
form: 


f  (X) 


2  ,  2 
W  X  +  W  X 
11  2  2 


+  VlX2  +  Vl  +  W5X2  +  W( 


(4.13) 


No  a  priori  knowledge  of  the  distributions  is  assumed  by  the  weight 
vector  and  so  its  elements  are  all  set  equal  to  one  at  the  outset  of 
the  training  process.  The  cost  functions  are  assumed  to  be  equal  in 
section  4.2  and  are  assumed  to  take  on  some  constant  value  so  they  are 
arbitrarily  set  equal  to  one.  The  system  is  then  simulated  on  an 
IBM  APL\360  system. 

The  training  process  as  given  on  page  45  is  repeated  one 
hundred  times  and  the  mean-square-error  decreased  for  each  iteration. 
The  change  in  the  mean-square-error  is  found  to  be  from  M  =  597.6  for 
W  =  1,1, 1,1, 1,1  to  M  =  1.089  after  one  hundred  iterations.  The  weight 
vector  at  this  point  is 


W  =  -.2253,  -.03314,  -.06276,  .2987,  .09026,  .8023 
A  graph  of  the  mean-square-error  and  the  number  of  errors  produced  by 
the  weight  vector  at  the  end  of  each  iteration  is  shown  in  Figure  4.2. 

The  number  of  errors  caused  by  the  discriminant  function 
derived  from  the  weight  vector  after  training  is  36,  28  from  category 
two  and  8  from  category  one.  The  probability  of  error  arising  from 
the  use  of  this  function  on  a  pattern  space  containing  continuous 
Gaussian  distributions  can  be  calculated  in  a  similar  way  to  the  cal¬ 
culations  given  in  section  4.2.  One  major  difference,  however,  is 
.that  the  probabilities  of  misclassif ication  from  the  two  categories 
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NUMBER  OF  ERRORS 
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are  not  equal  and  so  the  integral  must  be  solved  for  each  distribution. 
The  probability  of  error  due  to  misclassif ication  of  patterns  from 
category  two  will  be 


a  -(x.-2)2/2  f,(x  )  -(x9-2)2/8 

P  =  q  {  ( 1  / 4tt )  /  e  /  e 

VW 


dx  dx  } 
2  1 


(4.14) 


where  f  (x  )  and  f  (x  )  can  be  obtained  by  solving  the  discriminant 
function  for  x^  at  the  decision  boundary,  and  a^  and  a^  are  the 
extreme  values  of  the  decision  boundary  in  the  x^  direction. 

The  probability  of  error  due  to  misclassif ication  of 
patterns  from  category  one  can  be  solved  by  taking  the  integral  of 
its  probability  distribution  outside  category  one.  However,  it  is 


much  simplier  to  take  the  integral  over  all  space  and  subtract  the 
integral  inside  this  category.  Since  the  integral  of  a  probability 


density  function  over  all  space  is  one  we  get 


a9  -xi  / 2  f 9 (x  )  -x9 

p  =  q  {1  -  ( 1  / 4tt )  /  e  /  e 

vai  vW 


2/8 


dx  dx  } 
2  1 


(4.15) 


The  probability  of  error  when  using  this  discriminant  function  on 
continuous  Gaussian  distributions  will  then  be  P  =  P  +  V For  the 
example  considered,  P  is  found  to  be* 

P  =  .0743  +  .2925  =  .3668 


This  is  approximately  the  same  conclusion  as  that  resulting  from  the 
sample  set.  A  graph  showing  the  points  from  each  category  as  well  as 
the  decision  boundaries  produced  by  Baye's  Rule  and  by  the  training 
process  is  given  in  Figure  4.3. 


*See  Appendix  C  for  calculations. 
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4.4  Probabilistic  -  Descent  Classifier 

In  this  section,  a  nonparametric  training  method  proposed 
by  Amari17  is  discussed.  The  criterion  used  in  this  training  procedure 
is  the  probability  of  an  error  in  classification  multiplied  by  the  cost 
of  such  an  error.  Amari  calls  this  "the  average  risk  of  misclassif ica- 
tion".  Using  the  terminology  of  the  preceding  section,  this  criterion 
can  be  defined  in  terms  of  the  weight  vector  W  as  follows: 

R(W)  =  /  q  p  (X)C(2/l)dX  +  /  q  p  (X)C(l/2)dX  (4.17) 

V  2  2  V  1  1 

1  2 

The  objective  of  the  training  procedure  is  to  find  that  W  which  mini¬ 
mizes  R(W).  This  can  be  achieved  by  setting  the  gradient  of  R(W)  in 
the  weight  space  equal  to  zero. 

VR(W)  =0  V  =  (9/9w  ,  9/9w  ,  . ..,  9/9w  ,  9/9w  ) 

12  n  n+1 

(4.18) 


This  gradient  is  very  difficult  to  obtain.  However,  it  has  been  given 
by  Amari  in  his  paper  as 


VR(W)  =  (1/llWli)  /x{q  d  (X)C(2/1)  -  q  p  (X)C(l/2)  }dX  + 

D  2  2  11 

V{ q  p  (X) C (2/ 1) }dX  +  /  V { q xp x (X) C(l/2)}dX  =  0 


(4.19) 


where 


W II 


n  9  % 

=  (  T  w  )  and  D  is  the  decision  boundary.  Because  the 
i=l  1 


values  of  P  (X)  and  P  (X)  are  assumed  to  be  unknown  this  equation  can- 
1  2 

not  be  solved  for  W.  Even  if  these  probabilities  were  known  the  solu¬ 
tion  would  in  general  be  very  difficult  to  obtain. 

If  we  now  use  the  values  given  in  section  4.3  for  the  para¬ 
meters  in  equation  4.19,  the  last  two  integrals  are  zero,  since  q  ,  q2 , 
C ( 1/2)  and  C(2/l)  are  constants  and  p^X)  and  p2(X)  are  functions  only 


xe 

laiiiaaslJ  xnaoi  ad  -  aidaxi idsdod^  ^*A 

basoqoiq  boddam  gninifidd  o  1 1 X surras qno/i  r  ,noid99?  aidd  nl 
.  t  .;.yc  '  i  gninii  aind  nr  baan  no >1  ui*/:j  v IT  b  no  <b  r 
;  go  add  ^t!  bailqxdluni  rroidsoxiig  rsJ  >  ii  dorr  9  ns  o  <dl  id  adodc;  9<  i  ax 

nolzaJiio  aidd  tnoidaae  gnibaoadq  aril  Xo  v;goIon±flrd9d  arid  gnie  7  • "noid 

:awoIIoX  as  W  no3x  jv  if  tew  rl  o  aoidad  ni  banilab  ad  nso 

(U.A)  Xb(£\I)0(X)  q  p  \  +  Xb(I\S)D(X)  q  p  -  (W)« 

-inira  rfoiriw  W  dsril  bniX  od  ai  adi/baaodq  gnirrisdd  aril  Xo  svidoat<f0  sdT 
nx  (W).H  Xo  dn9ibsdg  9rfd  gnxddae  ^d  bavai  or  »d  nso  axriT 

•  079s  od  Isupa  99sqe  drigiaw  arid 


;6\6  <  w6\6  t...  .  wG\6  , fw6\G)  *  V 

s  isqsq  aid  ni  iitmA 


(I\S)3(X)  qtp1x\  (W\I)  “  (W)s» 

S  I 

(ex.*) 

j  i-i 


*  ji  W  ,!•  ad9rfw 


-nso  toidsups  aidd  tpaorAau  a  .  od  b9mira?  3  if  X)  H  *  s  ,  )  ^  X  :  >  ax; 
-uioi  sdi  nwond  adaw  aaxdxlidsdcn  q  sard  1.  i  •  1  ic  :  b.v.Ioa  id  don 

.n  sddo  od  din  iJ  .  tb  \;x>  3d  isisnsg  c:  ■  iluov  noid 
-si  iq  9rid  doX  C.A  roxdoaa  ni  i*  a  "  -sv  arid  aau  on  s  r  II 

<f  >  9onia  ( oi  as  9  is  aisdgadm  ov  d  ia>  I  add  «  .  noid£  i  ax  dm 


4 


52 


of  the  pattern  vector  X.  The  first  integral  is,  however,  much  more 
difficult  to  work  with  than  the  following  two.  For  this  reason,  Amari 
assumes  the  cost  function  to  be  a  function  of  the  distance  from  the 
decision  boundary  in  such  a  way  that  its  value  on  the  boundary  is  zero. 
For  this  special  case,  the  first  integral  in  equation  4.19  is  zero  and 
the  optimal  weight  vector  is  given  by  the  solution  to  the  equation 

VR(W)  =  fv{ q  p  (X) C (2/ 1) }dX  +  /  V{q  p  (X)C(l/2)}dX  =  0 
V  2  2  V  1  1 

1  2  (4.20) 


The  only  part  of  the  integrands  of  these  integrals  which  is 
a  function  of  W  will  then  be  the  cost  function  C ( 1 / 2 )  and  C(2/l). 
Therefore 


VR(W)  = 


%! 

ZV 


p  (X)VC(2/l)dX  +  qj  Pi  (X)VC(l/2)dX  =  0 
2  Kr  1 


V, 


(4.21) 


For  a  nonparametric  training  method,  the  weight  vector,  W,  is  changed 
by  an  amount  6W  every  time  a  pattern  vector  is  classified.  In  Amari’ s 
training  method  this  change  in  W  is  given  by 


where 


6W.  =  E  H(X. ,W.) 

l  ii 


(4.22) 


H(X,  W) 


H^(X,  W)  when  f (X)  <  0  and  X  e  A^ 
H^(X,  W)  when  f (X)  >  0  and  X  z  A^ 


0  when  X  is  correctly  classified 

and  E  is  a  positive-definite  matrix.  This  training  procedure  is 
guaranteed  to  converge  if  =  -VC(l/2)  and  =  -VC(2/1).  The  proof 
of  this  statement  is  quite  simple.  We  know  6W  =  EH  for  each  misclassi- 

fied  pattern,  so  the  average  change  in  W  is 
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6W  =  Eq  /  p  (X)  H  dX  +  EqJ  p  (X)  H  dX 
ly  11  V  Z  ^ 

2  1 

=  -Eq  /  p  (X)VC(l/2)dX  -  Eq  /  p  (X)VC(2/l)dX 
V2  V! 

=  -E  Vr(w) 

We  also  know 

3R(W)  /  3W  =  VR 

If  the  values  of  the  elements  of  E  are  sufficiently  small  we  can  approx¬ 
imate  3W  by  6 W.  We  can  then  say,  the  increment  in  the  risk  function  is 

6R(W)  =  6Wfc  VR 

for  a  change  of  6 W  in  W.  Using  the  value  just  solved  for  6W  the 
average  increment  in  the  risk  function  is 

6R(W)  =  6 WC  VR  =  -VRt(W)  E  VR(W) 

Since  E  is  positive  definite 

6R(W)  =  0 

But  R(W)  will  be  positive  since  the  probability  functions  are  always 
positive  and  the  loss  associated  with  misclassif ication  will  always  be 
positive.  Therefore,  the  average  change  in  R(W)  will  cause  a  decrease 
in  the  risk  until  the  optimum  W  is  reached  for  which 

<5R(W)  =  0 

Since  this  decrease  in  the  risk  function  occurs  only  as  an  average  and 
not  on  each  individual  step,  Amari  has  called  this  method  the  probabil¬ 
istic-descent  method. 

Amari  has  further  improved  this  training  theorem  by  incorp¬ 
orating  the  ability  of  the  system  to  learn  the  learning  rule.  This  is 
achieved  by  making  the  values  of  E  adaptable.  The  values  of  the 
elements  of  E  determine  what  size  the  increment  6W  will  be.  When  the 
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weight  vector  is  far  from  being  optimal  it  is  desirable  to  make  W  large. 
However,  as  W  approaches  the  optimal  value,  it  becomes  desirable  to  make 
the  change  small.  For  this  reason,  we  introduce  a  change  in  E  of  AE 
for  every  time  a  change  in  W  occurs.  The  value  of  AE  takes  on  is  given 
by 

AE  =  y  H (X,  W)  Ht(X’ ,  W’) 

when  the  pattern  X'  is  misclassif ied  by  using  the  weight  vector  W' , 
where  X  is  the  previously  misclassif ied  pattern  and  y  is  a  positive 
constant.  Amari  shows  this  change  in  E  will  produce  relatively  large 
changes  in  W  in  the  direction  of  the  gradient  of  the  risk  function  when 
W  is  far  from  optimal,  but  tends  to  reduce  the  absolute  value  of  6W  as 
W  approaches  the  optimal  value.  This  technique  would,  therefore, 
increase  the  accuracy  of  the  training  procedure. 

For  the  cost  function  Amari  suggests  using  the  form 

C  =  |  f  (X)  |  /  ||  W ||  (4.23) 

This  function  will  assign  a  small  cost  to  misclassif ied  patterns  near 
the  decision  boundary  and  much  greater  cost  to  those  far  from  the 
boundary.  However,  we  assumed  at  the  outset  of  the  present  discussion 
that  a  constant  loss  would  be  attributed  to  any  misclassif ication.  To 
approximate  such  a  situation  with  a  distance— cos t  function,  the  following 
cost  function  can  be  used: 

C (2/ 1)  =  C ( 1  / 2 )  =  {  |  f  (X)  |  /  ||W||}P  (4.24) 

where  p  <<  1.  Using  this  function,  the  cost  of  misclassif ication  will 
be  close  to  one  for  any  misclassif ied  pattern. 

Having  found  an  appropriate  cost  function,  we  can  now  find 
H(X,  W) .  Since  we  are  considering  misclassif ied  patterns  we  have 
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SO 


Therefore 


VC  ( 1/2)  =  p{-f(X)  /  llwl!)P  1  {(-X  /  ||  W  ||)  -f  (X)  V  (1  /  ||W  ||)  } 


W 1 1  =  (w,  +  w 


2.h 

w  ) 
n 


(4.25) 


v(l  /  Hull)  =  -!s(w  1  +  w22  +  ...  +  w n2)'372  x 


So 


Similarily 


x  (2w^ , 

=  -W  /  ||w 


>  . . . ,  2w  ) 
2  n 

3 


VC (1/2)  =  p{-f (X)  /  ||w||}p_1  {(-X  /  ||W||)  +  (W  /  Hwll3)} 

(4.26) 


C  (2/ 1)  =  {+f (X) }P  / 


P 


so 

VC (2/ 1)  =  p{+f  (X)  /  || W || } P_  1  {(X  /  || W ||)  -  Wf  (X)  /  || W|| 3 } 

(4.27) 

Setting  =  -VC(l/2)  and  =  -VC(2/1),  we  can  now  establish 
the  nonparametric  training  scheme. 

This  training  method  is  then  applied  to  the  same  problem 
mentioned  previously  and  is  simulated  on  the  IBM  API\360  system  using 
the  same  set  of  training  samples.  The  matrix  E  must  be  positive  definite 
and  so  for  simplicity  it  was  initially  set  equal  to  the  identity  matrix. 
The  values  of  H(X,W)  and  H(XT ,  W’),  however,  were  found  to  be  less  than 
one  and  so  y  was  arbitrarily  set  at  .5.  To  choose  an  appropriate  value 
for  p  some  calculations  were  performed.  After  some  trial  iterations, 


{(||  UK  \  I)V  ( m -  (It  w  It  \  X-))  ^<«wlt  V  (X)j-}q  -  (S\i)D  - 


08 


ria-tldadaa  won  nso  aw  ,(I\£)DV-  -  H  bns  (S\J)OV-  -  ,H  *.*>»• 


gfj  rgLi  01938^3  Od£/MA  mi  no  bsiztumls  zl  *>*?.  X  ^oIvb?  osnoi^a j 


,)l3Mbl  arid  OJ  XBI'PS  338  yllBiainl  u  -  Jl  Viollqmia  no’  o  bnt 

1  r  _  /U  VMJ  DCirr  f  RX7  Afi'l 


„Brfn  aaal  ad  od  bnnoi  anaw  .navawod  ,('«  ,'X)H  bns  (W,X)H  lo  asnlav  arfT 
dulBV 


■ 

,,  93sip qonqqa  na  aeoorfa  ol  .£.  da  daa  xllnanlidna  eaw  y  oa  bna  ano 


56 


it  is  found  that  the  value  of  | f (X) j  will  almost  certainly  lie  in  the 
range  10  ^  <  |  f  (X)  |  <  100  and  the  value  of  ||W  ||  will  almost  certainly  lie 
in  the  range  1  <  ||  W  ||  <  10.  This  means  |f(X)|  /  ||W||  will  with  a  very 
high  degree  of  certainty,  lie  in  the  range  10  ^  <  I  f  (X)  |  /  ||w  ||  <  100. 
p  should  now  be  chosen  so  that  C(l/2)  and  C(2/l)  will  be  close  to  one 
for  all  values  of  W  and  X.  A  choice  of  p  =  .05  is  made  which  makes  the 
range  of  the  cost  function  .562  <  C(l/2)  =  C (2 / 1 )  <  1.259  with  C ( 1 / 2 ) 
and  C (2 / 1 )  being  within  25%  of  one  for  almost  all  values  of  W  and  X. 

Using  these  values  for  the  constants  for  the  problem  and 
setting  the  initial  weight  vector  at  W  =  (1 , 1 , 1 , 1 , 1 , 1 , ) ,  the  simulation 
study  is  attempted.  The  results  are  given  by  curve  (1)  of  Figure  4.4.* 

As  can  be  seen  from  the  figure,  the  convergence  rate  is  very 
fast  and  the  weight  vector  approaches  the  optimal  but  the  solution  changes 
dramatically  during  the  16th  step  of  learning.  This  sudden  change  has 
been  traced  in  the  program  and  is  found  to  occur  on  a  misclassif ication 
of  a  pattern  from  category  two.  The  discriminant  function  for  this 
pattern  vector  takes  on  the  value  .0142.  The  weight  vector  at  this  point 
is  W  =  (2.208,  -.552,  -1.194,  3.764,  -1.021,  2.696).  This  gives  the 
norm  a  value  of  ||w||  =  4.673  yielding  a  cost  of  C(2/l)  =  .7484.  However, 


*This  graph  differs  slightly  from  Figure  4.2.  In  the  mean-square-error 
classifier,  the  weight  vector  was  changed  only  after  all  100  patterns 
had  been  received  and  the  appropriate  calculations  made.  In  the  prob¬ 
abilistic-descent  classifier  the  weight  vector  was  changed  each  time  an 
error  in  classification  resulted.  However,  it  would  have  required  far 
more  computer  time  to  stop  after  each  correction  and  check  to  see  how 
many  of  the  100  patterns  would  be  misclassif ied  by  this  new  weight  vector. 
For  this  reason,  the  weight  vector  resulting  after  the  100  patterns  had 
all  been  analyzed  was  the  only  one  checked  for  the  errors  it  produced. 

This  means  the  ’steps  of  learning'  referred  to  in  Figure  4.4  are  not 
the  number  of  times  the  weight  vector  has  been  changed  but  rather  the 
number  of  times  the  100  patterns  have  been  analyzed. 
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STEPS  OF  LEARNING 

FIGURE  4.4  NUMBER  OF  ERRORS  FOR  PROBABILISTIC  - 

DESCENT  CLASSIFIER. 
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h2  =  -VC (2/1)  =  p{f (X)  /  ||w||}P  1  {(X  /  ||W||)  -  Wf (X)  /  ||W||3} 
=  (-11.64,  -35.25,  -20.26,  -5.525,  -9.625,  -2.627) 
and  multiplying  this  by  E  results  in 

6W  =  (-5.901,  -45.218,  -34.878,  5.647,  -22.778,  5.285). 

The  big  numbers  result  from  multiplying  by  {f(X)  /  ||w||}p_1  where 
p-1  =  -.95.  This  means 

{f(X)  /  || W||}p_1  -  { f (X)  /  || w|| }-1 

and  since  f  (X)  /  ||W||  is  small  (approximately  .003)  its  reciprocal  will 
be  very  large.  This  problem  will  result  any  time  a  pattern  is  misclassi- 
fied  by  a  discriminant  function  which  gives  the  result  | f (X) |  <<  1. 

After  a  catastrophic  change  in  W  such  as  this,  it  would 
take  many  steps  of  learning  to  reduce  the  values  of  W  to  reasonable 
proportions  again.  To  overcome  this  problem  it  is  felt  that  increasing 
the  size  of  each  increment  would  allow  the  system  to  overcome  such  a 
problem.  So  the  value  of  y  is  changed  from  .5  to  5  and  the  simulation 
is  begun  again.  The  result  is  shown  by  curve  (2)  of  Figure  4.4.  It 
can  be  seen  that  such  a  change  only  degenerated  the  system. 

The  most  obvious  solution,  although  not  the  most  desirable, 
is  to  ignore  elements  of  the  pattern  sample  which  produce  the  result 
| f (X) |  <<  1.  This  is  done  by  testing  for  | f (X) j  <  .5.  For  pattern 
vectors  yielding  this  value,  no  change  is  made  in  W.  This  technique 
reduces  the  number  of  patterns  available  for  use  in  classification,  y 
is  set  back  to  .5  and  the  simulation  is  tried  again.  A  schematic  of 
the  system  is  given  in  Figure  4.5.  The  results  of  this  test  are  given 
by  curve  (3)  of  Figure  4.4.  As  can  be  seen  the  convergence  rate  has 
been  decreased  by  putting  this  test  in  the  system. 
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Fig.  4.5  PROBABILISTIC  - 
DESCENT  PATTERN  CLASSIFIER 
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However,  the  convergence  rate  is  still  much  superior  to  the  results  of 
the  mean-square-error  classifier. 


As  has  already  been  mentioned,  the  error  rate  given  in  Figure 


4.5  is  the  result  of  testing  only  the  weight  vector  which  results  after 
all  100  samples  have  been  tested.  It  is  also  evident  that  each  successive 
weight  vector  may  not  reduce  the  number  of  errors,  but  the  error  rate 
does  decrease  as  an  average.  For  this  reason,  the  final  weight  vector 


arrived  at  after  the  50  steps  of  learning  will  not  necessarily  be  the 


closest  one  to  the  optimal  solution.  The  solution  vector  for  this 


training  process  is  ,  therefore,  taken  to  be  the  average  vector  of  all 
those  realized  in  the  last  three  steps  of  learning.  This  vector 

W  =  -3.2896,  -.04201,  -.7481,  1.8927,  -1.1685,  5.5026 
which  produces  16  errors,  9  from  category  one  and  7  from  category  two. 

In  the  last  three  steps,  there  are  49  errors  which  averages  out  to  16.3 
errors  in  100  samples.  A  plot  of  the  discriminant  function  using  this 
average  weight  vector  is  given  in  Figure  4.6. 


The  effectiveness  of  this  discriminant  function  when  used 


on  a  continuous  distribution  can  be  measured  in  the  same  way  as  before. 
The  solution  graphed  in  Figure  4.6  is  one  section  of  a  hyperbola.  It 
is  described  by  the  equation 


-3.2896x 
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.04201X 
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2  -  . 7481x  x  +  1.8927x 
1  2 


1 


1.1685x2  + 


+  5.5026  =  0 


(4.28) 


Solving  this  for  x2  gives 


(4.29) 
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-OOPTIMAL  SOLUTION 

-^•SOLUTION  FROM  PROBABI¬ 
LISTIC  DESCENT  TRAINING 
PROCESS 

°  PATTERNS  FROM  CATEGORY  I 
•  PATTERNS  FROM  CATEGORY  2 


FIGURE  4.6  DECISION 

BOUNDARY  FOR  PROBABILISTIC 
-DESCENT  CLASSIFIER. 
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It  also  gives  f^(x^)  which  is  the  same  except  that  a  negative  instead 
of  a  positive  sign  precedes  the  square-root  term.  The  extremities  of 
the  hyperbolic  sections  can  be  determined  by  setting  the  term  under  the 
square-root  sign  equal  to  zero  and  solving  for  x^.  Doing  this  gives 


.00689x^  +  2.06636x^  +  2.29005  =  0 


which  gives 


Xl  =  -1.112  and  x  =  -299.70. 

This  means  the  other  section  of  the  hyperbola  never  goes  more  positive 
in  the  x^  direction  than  approximately  -300  and  so  the  integral  of  the 
probability  functions  inside  this  section  of  the  hyperbola  will  be  zero 
to  five  figures  of  accuracy.  Therefore,  to  deduce  the  probabilities 
it  will  only  be  necessary  to  integrate  inside  the  section  of  the  hyper¬ 
bola  shown  in  the  Figure  4.6.  For  this,  we  can  use  the  form  given  in 
appendix  C,  namely, 


P  =  q  {1  -  (1  /  4.)  f  e^l2  f2/Xl>  e'x22/8  dx  d*. }  + 
1  x  =a,  X  =f , (x, )  2  1 


1  1 


“2  i'  r 


+  a  ( 


1/4,  /2  e-^Xl-2)2  f2/Xl)  e-<x2-2>2/8dx  d*  > 


Vl 


VW 


where  a  =  -1.112,  a  =  00  and  f  (x  )  and  f  (x  )  are  as  given  above. 
[  2  -1-1  2  1 

From  appendix  C  this  expression  can  be  written  as  P  =  P  +  where 
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/  e“^(x l~2)  |erf  [f2  (x  )  -  2//E~\- 

al 


erf  [fx(x  ) 


-  2//8] 


dx^  } 


These  integrals  were  solved  by  computer  methods  to  give 


P  =  .5(1  -  2 . 7281//8tT}  =  .12712 

and 


P2  =  . 5{  .93845/ /8tt}  =  .0936 

and  so 


P  =  P  +  P  =  .2207 
1  2 

This  means  that  this  discriminant  function  could  be  expected  to  classify 
continuous  Gaussian  patterns  with  a  mean  and  variance  as  given  with 
approximately  22%  misclassif ication,  9.3%  from  category  two  and  12.7% 
from  category  one. 


4.5  Comparison  of  the  Two  Systems 

The  two  systems  used  for  training  in  the  last  two  sections 
differ  very  much  in  their  capabilities  for  finding  the  optimal  solution. 
The  mean-square-error  classifier  has  a  smooth  convergence  so  that  every 
change  in  the  weight  vector  is  an  improvement.  The  convergence  rate, 
however,  is  very  slow.  The  reason  for  the  poor  convergence  rate  appears 
to  be  more  the  fault  of  the  performance  criterion  than  the  search  tech¬ 
nique.  For  example,  the  optimal  solution  f (X)  =  -4x^  -x2  +  5  gives 
a  mean-square-error  of  38.45.  The  mean-square-error  during  training 
remains  below  this  value  after  the  third  step  of  learning  and  yet  the 
solution  never  reaches  the  optimal.  The  reason  for  this  is  that  the 
mean-square-error  performance  criterion  assumes  the  discriminant  func¬ 
tion  to  be  a  mapping.  Thus,  multiplying  f (X)  by  an  appropriate  constant 
can  produce  a  far  greater  change  in  mean-square-error  than  a  change  in 
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individual  elements  of  the  weight  vector  even  though  this  multiplication 
by  a  constant  does  not  change  the  solution.  This  performance  criterion 
then  places  the  greater  importance  on  the  absolute  magnitudes  of  the 
weight  vector  elements  and  a  far  lesser  importance  on  their  relative 

magnitudes.  However,  for  classification,  absolute  magnitudes  are  of  no 
importance,  since  the  solution  is  determined  by  the  relative  magnitudes 
together  with  the  signs  of  the  individual  elements. 

The  mean-square-error  training  procedure  requires  slightly 
more  than  twenty  seconds  to  process  100  samples.  This  means  the  complete 
training  process  took  over  thirty-three  minutes  of  computer  time  to 
complete.  On  the  other  hand,  the  probabilistic-descent  process  only 
calculates  a  change  in  W  when  a  sufficiently  large  error  in  f (X)  occurs. 
It,  therefore,  takes  less  time  to  do  the  100  samples^a  little  more  than 
fifteen  seconds.  However,  for  this  method  the  training  procedure  need 
only  be  repeated  50  times  to  obtain  the  solution  given  and  so  the  total 
time  required  will  only  be  about  thirteen  minutes.  This  time  can  be 
compared  with  one  minute  and  thirty  seconds,  the  time  given  by 
Patterson  and  Womack  for  their  solution.  Although  the  information  given 
in  the  paper  by  Patterson  and  Womack  is  not  very  complete,  their  results 
show  they  have  obtained  a  better  solution  in  far  less  computing  time 
than  we  have  been  able  to.  However,  because  the  probability  of  error 
produced  was  low  as  explained  in  section  4.2,  one  possible  explanation 
for  the  discrepancy  might  be  that  the  sample  set  used  by  these  authors 
was  a  poor  representation  of  a  Gaussian  distribution. 

By  comparing  Figure  4.1  with  Figure  4.4,  we  find  the  major 
difference  in  the  physical  structure  of  the  two  systems  to  be  the  amount 
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of  memory  required.  In  the  probabilistic-descent  classifier,  the  only 
memory  required  is  for  the  evaluation  of  the  matrix  E.  This  would  make 
this  scheme  more  practical  for  implementation. 

The  probabilistic-descent  training  process  has  one  quality 
which  is  inferior  to  the  mean-square-error  training  procedure  and  this 
is  the  fluctuation  of  the  convergence,  compared  to  the  smooth  conver¬ 
gence  of  the  mean-square-error  procedure.  In  the  probabilistic-descent 
method,  the  solution  moves  toward  the  optimal  as  an  average  but  each 
individual  change  in  the  weight  vector  is  not  guaranteed  to  improve  the 
solution.  This  disadvantage  is,  however,  compensated  by  the  increased 
convergence  rate.  A  summary  of  the  results  of  the  two  training  pro¬ 
cedures  is  given  in  Table  4.2. 
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CHAPTER  5  CONCLUSIONS 

5 . 1  Summary 

In  this  thesis,  we  have  tried  to  present  the  reader  with 
enough  knowledge  about  the  field  of  pattern  classification  to  allow  him 
to  pursue  the  topic  in  the  current  literature.  The  major  portions  of 
chapters  two  and  three  consist  of  material  collected  from  many  different 
sources,  most  of  which  are  listed  in  the  bibliography.  Some  parts, 
such  as  section  2.5  and  parts  of  section  3.2,  are,  however,  original 
ideas  which,  it  is  hoped,  will  add  to  the  ideas  already  available.  The 
work  of  chapter  four  has  been  carried  out  to  familiarize  the  author 
with  an  actual  classification  problem  and  to  check  the  validity  of  an 
existing  publication.  It  is  hoped  that  more  work  can  be  done  on  this 
problem  and  that  the  results  can  be  discussed  with  the  authors  of  that 
publication. 

5.2  Some  General  Comments 

Certain  aspects  of  pattern  classification  have  been  studied 
thoroughly  while  other  aspects  have  received  relatively  little  attention. 
For  example,  linear  classifiers  discussed  in  section  2.2  have  been 
studied  very  thoroughly  and  many  of  the  practical  examples  that  have 
been  produced  are  linear  machines  e.g.  ADALINE2 .  On  the  theoretical 
side,  Highlyman8  has  done  a  complete  Doctoral  Dissertation  on  linear 
decision  functions.  On  the  other  hand,  the  study  of  layered  machines 
has  not  progressed  very  far.  One  of  the  main  reasons  seems  to  be  the 
problem  of  training. 

In  the  area  of  training,  new  nonparametric  training  methods 
for  linear  machines  seem  to  be  appearing  continually  in  the  current 
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literature.  These  training  methods  can  also  be  applied  to  any  $  machine 
since  a  $  machine  is  linear  with  respect  to  its  weight  vector.  New 
methods  of  parametric  training  are  not  as  easy  to  produce.  The  methods 
already  available,  most  of  which  can  be  found  in  the  review  of  NagylH, 
are  generally  accepted  as  adequate  for  most  parametric  training  problems. 

5.3  Possible  Areas  for  Further  Research 

There  still  are  many  basic  problems  to  be  solved  before  the 
full  potential  of  learning  machines  can  be  realized.  One  such  problem 
is  the  formulation  of  a  transfer  function  for  a  linear  machine.  If 
this  could  be  done,  networks  of  these  machines  could  be  both  analyzed 
and  synthesized  with  much  less  difficulty  and  far  more  complex  classi¬ 
fiers  could  be  studied.  Another  problem  which  has  not  as  yet  been 
solved  is  the  training  of  more  complex  machines  such  as  layered  machines. 
This  problem  could  be  very  well  worth  studying  because,  as  was  shown  in 
this  thesis,  any  two  classes  which  are  separable  by  a  polynomial  func¬ 
tion,  no  matter  how  complex,  can  be  separated  by  a  layered  machine  pre¬ 
ceded  by  a  $  processor.  In  the  field  of  parametric  training,  the  basic 
training  problems  remaining  include  the  introduction  of  new  probability 
functions  and  methods  for  obtaining  the  parameters  for  them. 

The  future  of  this  study  of  pattern  recognition  by  adaptive 
pattern  classifiers  seems  to  lie  in  the  applications  of  the  study.  As 
yet,  the  theoretical  knowledge  has  far  out-run  the  experimental  successes. 
Other  areas  of  study  seem  to  be  picking  up  the  ideas  already  presented 
and  using  them.  One  of  the  best  examples  is  the  field  of  control 
systems  where  both  on-line  and  off-line  learning  techniques  are  receiving 
an  increased  amount  of  attention18.  This  research  is  leading  to  control 
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systems  known  as  self-organizing  systems.  Such  systems  are  defined  as 
systems  which  change  their  basic  structure  as  a  function  of  experience 
and/or  environment20.  Widrow4  has  even  mentioned  the  possibility  of 
self-repairing  systems.  It  would  take  little  imagination  to  realize  the 
significance  of  such  systems.  In  the  future,  as  more  of  these  applica¬ 
tions  are  developed,  the  demand  for  research  in  the  area  of  pattern 
classification  will  increase. 


.  ->2i  £..  L  :  o-  a  a:?  nv off*  smalt va 

o  iliac  *  a  no.  la  s  as  alulou  2  •  a  )i*  srfl  snario  rfoiriw  8JT3laTe 

.  *,;■.• 


ari3  :i  ^1  ol  i  i.iJt  aLliK  >.>  L  ruow  ll  .  a  la  a  g/ililac,  ai-SIae 


rfl  ni  ifoisaaai  10I  fenerasb  stij  tbaqolsv3b  axs  enoil 


/-*B):,s):onl  II Iw  nollaojtlip.BBlo 


70 


BIBLIOGRAPHY 


1.  Widrow,  B. :  "Generalization  and  Information  Storage  in  Networks  of 

Adaline  ’Neurons'";  Yovits,  Jacobi  and  Goldstein, 
"Self-organizing  Systems  -  1962";  Spartan  Books, 
Washington,  D.C.;  pp.  435-461;  1962. 

2.  Widrow,  B. :  "An  Adaptive  ADALINE  Neuron  Using  Chemical  Memistors"; 

Stanford  University  ERL  Tech.  Report  No.  1553-2; 

October  17,  1960. 

3.  Huber,  William  A. :  "Learning  Machine  Techniques  for  Pattern  Classi 

fication";  Proceedings  of  the  National  Electronics 
Conference;  vol.  21;  McCormick  Place,  Chicago,  Illinois 
pp.  517-522;  October  26,  27  and  28,  1965. 

4.  Widrow,  B. ,  G.F.  Groner,  M.J.C.  Hu,  F.W.  Smith,  D.F.  Specht  and 

L.R.  Talbert:  "Practical  Application  for  Adaptive 
Data-Processing  Systems";  Review  A;  Brussels  6,  Belgium 
January,  1968. 

5.  Darling,  Eugene  M. ,  and  R.  David  Joseph:  "Pattern  Recognition  from 

Satellite  Altitudes";  IEEE  Transactions  on  Systems 
Science  and  Cybernetics,  vol.  SSC-4,  No.  1,  pp.  38-47; 
March,  1968. 

6.  Feller,  William:  "An  Introduction  to  Probability  Theory  and  its 

Applications,  vol.  1";  2nd  Edition;  John  Wiley  and  Sons 
Inc.;  New  York,  N.Y.;  p.  36;  1957. 

7.  Nilsson,  N.J.:  "Learning  Machines:  Foundations  of  Trainable 

Pattern-Classifying  Systems";  McGraw-Hill  Book  Company; 
New  York,  1965. 

8.  Highleyman,  W.H.:  "Linear  Decision  Functions,  with  Application  to 

Pattern  Recognition";  Proc.  IRE,  vol.  50,  part  2 
pp.  1501-1514,  1962. 

9.  Anderson,  T.W. :  "An  Introduction  to  Multivariate  Statistics"; 

John  Wiley  and  Sons  Inc.,  New  York,  N.Y.;  1958. 

10.  Kashyap,  R.L.  and  Colin  C.  Blaydon:  "Estimation  of  Probability 

Density  and  Distribution  Functions";  IEEE  Transactions 
on  Information  Theory,  vol.  IT- 14  No.  4,  pp.  549-556; 
July,  1968. 

11.  Cooper,  P.W.:  "Multivariate  Extension  of  Univariate  Distributions" 

IEEE  Transactions  on  Electronic  Computers. 


Yb  .  Jt  'rQ] 


~.no*tusW  aallabA 

; "aioiffcimsM  l&oliaedO  g  ie'J  nojusli  V/J IJA<  A  sr/iiqebA  aA  :  .3  (wo7blW 
£  .o/;  J  oq  ?.  .  Idm  J91 3  viier  v-.''.  b:to  'flfilS 

•  d<?.  .  .1  .•-•  :  jO  , 

sdIhoi:  >e  3  ‘o  <  t  U-j-jo i.  3  ;''n©ii* oil 

;«  oo  .  ,  1(0  t  3  i  J>  j  Dim  :  ’  1  ;££  .  c  \r  ;-o;  loirc  ) 

■  I  tb£  r£.  ■  :  #dS  scj.j  '  -  -'■fi  .  /4  ■ 

baa  iffosqS  . 3.G  <ri->xai2  .W.f  ,uH  .D«L.K  t7sno70  .3.0  , .  I  ,woiblW 

sux^dlgT  .JI.J 

:  rti/1  ’  >a  t d  elseauxC:  ;A  v  v:>  :  a  iav3  .niesoocnd-slBG 

.8. IQ  ty*t>,unsL 

in  n'l  .  >i:;  ‘  r  -i on :  >1  it  sJ3.  h  :r  qeac  .  fcvrd  .9  br*.  ,  .M  sxrsguS  «gnlIus.(J 
■  5i  .  no  o  d  ,-i  II  ;  ‘iki.3  .  9Ji  Is  mc 
A-  ,  .  1  A -j  .  ill t#n  a.  yW-^xif;  a  n  '33 

,  bciGI  frioiBM 

£  j  i  Lib  v/'  ■  3  j  iT  y  d  cfc  I  c<1  n  :.i  lot  bo7.i  I  aA'  small  I  i  ,73 Ils- 
enoS  bns  yaliW  nriol  ;nox3ib3  bn£  ;;  'I  .lov  < arro±:J6DlIqqA 

S.L.M  «rioe8lill 
.  £d<?I  ,jJioY  w sV[ 

o7  noiixuxIqqA  ritJiw  t8noll3fiy,5  nolslotG  xesniJ" 

£  dxsq  t06  .Xov  t3HI  .ooxd  noi3xng039lI  mrsJJsfl 

i  ei-10cl  .qq 

:.W.T  tnoei3bnA 


.A 


71 


12.  Ahmed,  N.U.:  "A  Sequence  of  Probability  Measures  on  the  Euclidean 

n-space  and  its  Limit";  AECD/EL/6;  Dacca,  Pakistan; 
May,  1967. 

13.  Rosenblatt,  F. :  "Principles  of  NEURODYNAMICS";  Spartan  Books, 

Washington,  D.C.;  pp.  111-116;  1962. 

14.  Nagy,  George:  "Classification  Algorithms  in  Pattern  Recognition"; 

IEEE  Transactions  on  Audio  and  Electroacoustics, 
vol.  AU-16,  No.  2;  June,  1968. 

15.  Chien,  Y.T.:  "Simultaneous  Detection  and  Estimation  of  Pattern 

Characteristics  in  Self-Learning  Systems";  Presented 
to  the  6th  Annual  Allerton  Conference  on  Circuit  and 
System  Theory;  Allerton  House,  Monticello,  Illinois; 
October  2-4,  1968. 

16.  Patterson,  J.D.  and  B.F.  Womack:  "An  Adaptive  Pattern  Classifica¬ 

tion  System";  IEEE  Transactions  on  Systems  Science  and 
Cybernetics,  vol.  SSC-2,  No.  1;  pp.  62-67;  August, 
1966. 

17.  Amari,  Shunichi:  "A  Theory  of  Adaptive  Pattern  Classifiers"; 

IEEE  Transactions  on  Electronic  Computers,  vol.  EC- 16, 
No.  3;  June,  1967. 

18.  Mendel,  J.M.  and  J.J.  Zapalac:  "Realization  of  a  Suboptimal  Con¬ 

troller  by  Off-Line  Training  Techniques";  Preprints  of 
JACC  67  papers;  University  of  Pennsylvania;  pp.  258- 
266;  June  28-30,  1967. 

19.  Hamming,  R.W. :  "Numerical  Methods  for  Scientists  and  Engineers"; 

McGraw-Hill,  N.Y.;  pp.  34  and  389;  1962. 

20.  Mendel,  J.M.  and  J.J.  Zapalac:  "Advances  in  Control  Systems"; 

vol.  6;  1968,  Academic  Press,  New  York  and  London; 
chapter  1. 


K 

n6sb.fclD.u3  arfn  no  adiuassM  ^XXidBdoiS  Xq  3Diisups2  A':  :.U.I£  tb9mr!A  .SI 
;  non  elite 3  ,bodsG  ;d\J:i\CD3A  ;"i±mlJ  ani  bn  oosqa-n 

.  d9L  <xsM 

.Sd(M  ;dIi-IXi  ,qq  ;  .0.3  tnon&nxrfaBW 

t aoineuoosoTnosia  bne  olnuA  no  t  h  o8nai.i  3331 

.  8d£I  t9nuL  ; l  .  oW  *dl~ .A  . iov 

n^9J^63  lo  noinBirrinea  bns  aoinasnsG  auGa/.'&niumiS"  :.T.Y  ,/i9±riD  .d 
l>xjb  Jjtuoii  J  no  9onsnsino0  rtcnxeJ  CA  I^uanA  rind  srfn  on 

d  <  .  .  sii  L  ;£  .oil 

X 

••o  ;■  iJc::  ]  u  .i  d  3  -61 1  dn —  -110  n  .u  "oo  • 

-  !S  fq  ;6io  sjIv;  tr  niais/XfiCf  ia'sgaq  Sd  •  , 

.  rrl  tQi  8S  s  l  .  >$ 

!l  Of  i  n  is-  ?-  -  7^  .  ?nj  !  •:  bH  .  81 

. !'  n  i  '  :  ,  ;  .  Y  ,  K  ,  '  -WB  .  f 

;  c  j:  i-  noD  ni  a  om  :  du±j •  •,.*>.<  .L.  .  hu  ,  .M,t  t XsbnaM  .OS 

oo  iGu  bns  x’xoY  •  s  ,aa ;  d  -bf  nA  «•  91  ,  d  .Iov 


72 


APPENDIX  A 


CALCULATION  OF  THE  ERROR  PROBABILITY 
FOR  THE  BAYE'S  SOLUTION 


Given  a  Gaussian  probability  distribution  we  get: 

1  ,  r  -  . . t  -1 


/  Px(X)dX  = 


V2={X:f(X)<0} 


2  TT  I  £ 


i  1/2 


/  exp{-l/2(X-M)  2  (X-M)}dX  (A.l) 


V2={X:f (X)<0} 


With  M  = 


and  £ 


1  0 
0  4 


this  becomes 


+°°  2  io 
1  _X9  /8 

7 —  '  e 

4tt  ; 

X  =—oo 

2 


-4-oo  9 

-x^/2 

/  e  dx„ dx 


12 


(A. 2) 


x  =g(x  ) 


where  gCx^)  results  from  solving  the  discriminant  function  at  the  deci¬ 
sion  boundary  for  x^»  The  second  integral  in  A. 2  can  be  considered  as 


2  separate  integrals. 

■+"30  2  ; 

-x  /  2  , 

/  e  dX;L  =  /  e 


o°  2  > 

-x//2 


g(x2)  -x,2/2 


dx. 


/  e  1  dX;L 


(A.  3) 


X1=S(X2) 


xr° 


The  first  integral  on  the  right  hand  side  of  A. 3  can  be  solved  by  stan¬ 
dard  methods  (see  ref.  6,  p. 164-166).  This  gives 


00  2  /o 

-x.  /  2 

/  e  dx. 


(A. 4) 


o 


For  the  second  integral  let  u 

X 

du  = 

II 

t-o]  |l 

V  • 

this  gives 

g(x2)  -x  2/2 
/  e  1  dx. 

g(x2) 

u  — 

■  / 

^  A 

II 

o 

u=0 

n 

erf 

dx. 


/2 


~u 

e  du 


g(x2) 

'll  J 


(A. 5) 


a  xiowaqqA 


:d£>3  aw  noj  3  Jdii3c;ib  ydllJ:  adoiq  last  :D  &  neviO 


f.  T 


-  )8*  X 


-  d  ■  f  :j  ncilanr.  i  nl/al  a*:..  '  h  gj  1" '  *  ■  a  o  Jluesi  I  ..x)s 


.  .'  n<  oj-f .'.  d  ;  '  o  .  r  1  cico  ;  ril  .  .>.  to 

.  alsigsd 


ymbnuod  nola 


xb 


alaigsdnJt  siBifiqse  £ 
£j  \  xb* 


, ( ddl-Adl  q  «d  .1st  saa)  aboriJ9m  biab 


•  -  u  j  *I  i  tgsdoi:  bnooas  arid  io1? 
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where  we  use  the  definition 


X 


-u 


erf(X)  =  —  f  e  “  du 

/a 


we  therefore  have  the  result 

+oo  2 

-x//2 

/  e  dx^  = 

Xi=g(x2) 


{1  -  erf 


g(x2) 

/2 


Substituting  this  result  into  equation  A. 2  gives 


4/2tT 


+°°  -x  2/8 

/  e  2 


1  -  erf 


g(x2) 

7i 


dx. 


(A.  6) 


(A.  7) 


The  first  term  in  A. 7  can  be  determined  by  using  the  same 


method  applied  to  the  first  term  in  A. 3.  This  gives 

+C 

/ 


+0°  2/0 
-x  /8 

i  dx2  =  /8tT 


(A. 8) 


x  =— °° 

2 


The  second  term  in  A. 7  cannot  be  solved  analytically  and  so 
a  computer  integration  method  is  used.  From  equation  4.1  the  function 


g(x^)  can  be  determined. 


5  -  x. 


g(x2)  =  — ^ -  (A. 9) 

Substituting  this  into  the  second  term  in  A. 7  and  using  a 


computer  integration  gives 

+°°  2 


i 


V  S  —  oc 
2 


-X  /8 
e  erf 


(5  -  x2) 
4^2 


dx2  =  3.692 


(A. 10) 


We  therefore  get 

P (ERROR)  =  — —  -  3.692}  =  .1317  (A.  11) 

4^27 

For  the  second  integral  it  is  most  convenient  to  integrate  p-^(X)  over 
the  space  for  which  it  is  correctly  classified  and  subtract  this  from  1. 


I»mn±i»b  ad3  9W 


dlueai  arid  3VBrf  sioiaisrii  aw 


8\  x- 


.rid  nuiau  yd  baa  anrs dab  3d  ti&o  X.A  ai  anal  da"i±i  ©dT 


.£.A  ni  an  93  38  rid  arid  od  bollqqn  boridam 


>— -X 


os  bnB  YlIfiDid^IsfiB  bsvlofi  3d  donned  T.A  ni  ansd  bnoosa  artT 
naidonnd  arid  I. A  xro  dsupe  morii  .bsen  ai  borids.:  ncl*i  ;  JdnJ:  doduqxrroo  £ 

.b  rLnsdefc  dd  neo  (  x)g 


b  gnieu  bn£  ^  *A  ni:  misd  brroasa  aril  odni  airid  g  ri  dud  idaduS 


9  v  ;  r.  »J  i  -■  ’ja:J  r  i  -  adiiqm  o 


JS8  »XQlS«rii  9W 

£I£I  >  -  {£ea,£  -  rr8\  ; 
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The  decision  surface  for  this  case  is  a  parabola  symmetric  about  the 
x^-axis.  /.the  probability  of  error  is 


a  -X- 2/2  §2(xl)  -x  2/8 


P (ERROR)  =  1  -  /  ?1(X)  dX  =  1  -  /e1  / 


V1={X:f (X)>0} 


X_^=— oo 


x2=0 


86.18  -  64x. 


1/2 


where  g2(x^)  is  the  positive  root  of  x2  =  {- 


dx2dxl 


(A. 12) 


and  "a"  is  the  value  of  x^  for  which  x2  =  f(x^)  =  0 

86.18  ,  n/, 

1-e-  a  =  =  1-34/- 

Using  a  similar  method  to  that  used  in  A. 5 

g2(xl> 

g2(xl)  -x  2/8  U  “ 

/  e  dx  = 


x2=0 


/  ^  / 8  e*u2  du  =  ^7  erf 

u=0  & 


The  remaining  integral  is  solved  by  computer  methods 


Jr* 

2” 


x  =a=1.347  2 ,  N 

1  -x  / 2  g9(x  ) 

/  e  erf  {— — — }=  .8693 


y  00 

X1 


/8 


(A. 13) 


(A. 14) 


/.  P (ERROR)  =  .1307 

The  third  decision  boundary  is  a  hyperbola  symmetric  about 
both  the  x^  and  x2  axes.  We  therefore  need  only  integrate  the  distrib¬ 
ution  over  one  quadrant  and  multiply  by  4  to  get  the  complete  integral. 


P (ERROR)  = 


a 

e 


4tt 


-x  2/2  g3(xl)  -x  2/8 
e  j  e  dx„dx 


2  1 


(A. 16) 


y  oc 

xi 


x2=0 


The  discriminant  function  for  this  case  is 


22.18  -  15X;l2  +  12x22 


f(X)  = 


32 
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so 


15x„  -  22.18 


83(x1)  1  12 


•} 


and 


/22 , 18  ,  rr- — 

a  =  /  — —  =  ±/1.48 


From  (A,  13) 


83(xP  -x,2/8 


/ 


e  ^  dx^  =  /2tt”  erf  {— — — } 


83(X1} 


x2=0 


/8 


Again,  using  computer  methods  gives 


(A. 17) 


a  1.215  2.  /  \ 

1  /  ^  OO  '^1  ' 

e  erf  { - }  dx.  = 

tt  *  nr  1 


CL  X  i 

/F  / 


.09886 


x^=-°o 


/8 


(A. 18) 


and  so  P (ERROR)  =  .0989 


(A. 19) 


The  fourth  decision  boundary  is  an  ellipse  and  so  it  will  be 
much  easier  to  integrate  p^(X)  over  and  subtract  this  value  from  one. 
Again  quadrant  symmetry  allows  us  to  integrate  over  one  quadrant. 


P (ERROR)  =  1  - 


4tt 


3  -X  2/2  84<x1>  -x  2/8 

/e  1  /  e  2 


x^=0 


x2=0 


In  this  case  the  discriminant  function  is 


f(X)  = 


44.35  -  12X;l2  -  3x22 


32 


44.35  -  12x, 


1/2 


and  so 


^(xp  =  {■ 


dx2dx^  (A. 20) 


and 


a  = 


'44 , 35 


12 


=  /3 . 696  =  1.922 


84(^X1)  -x22/8 
Again  from  A.  13  /  e 


dx2  ” 


x  =0 


r- 

/2tt  erf  { - — } 

/8 


(A. 21) 
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From  the  computer 


x  =a=1.922 

/ 


erf 


x 


84(x1} 

{■- 

/8 


.84256 


(A. 22) 


so  P (ERROR)  =  1  -  .84256  =  .15744 


(A.  23) 
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APPENDIX  B 


CALCULATION  OF  THE  GRADIENT  OF 


THE  MEAN- SQUARE- ERROR 


We  would  like  to  solve  for  a  value  of  s  and  A  which  will 
produce  a  minimum  value  for  M(W  +  As) .  We  have  assumed  a  quadratic 
discriminant  function. 

?  2 

f (W,X)  =  w  +  w  x  +  w  x  x  +  w  x  +  w  x  +  w 
11  22  312  41  52  6 

so  we  can  write 

2  2 

f  (W  +  As  ,X)  =  (w  ^  +  As1)x1  +  (w2  +  As2^x2  +  ^w3  +  AS3')X1X2 


+  (w.  +  As  )x  +  (w  +  As  )x  +  w  +  As. 
4  41  5  526  6 


=  f(W,X)  +  f (As ,  ) 


(B .  1) 


We  can  then  expand  the  terms  in  M(W  +  As)  using  the  form  given  in  (B. 1) 
{f(W,X.)  +  f (As ,X  )  +  C}2  =  { f (W,X . )  +  C}2  +  {f(As,X.)>2 

l  i  l  l 


+  2f(W,Xi)  f(As,X±)  +  2f(As,Xj  C 
for  i  =  a  or  6  and  C  =  — C ( 2 / 1 )  for  a  and  C  =  +C(l/2)  for  3. 


(B.  2) 


qi  N1  2 

M(W  +  As)  =  M(W)  +  —  l  {  [f(As,  X  )]  +  2f  (As  ,X  )  [  f  (W,X  ) 

•nt  i  2  a  a 

N^a=l 

q2  v2 


-C(2/i)]}  +  _  l  {  [f(As,X  )]Z 
N2  8=1  3 

+  2f (As ,X  )  [f(W,X  )  +  C ( 1 / 2 )]  } 

3  3 


We  also  can  see  from  B,1 


f  (As ,X)  =  Af(s,X) 


so 


(B ,  3) 


q  N  q  i  in 

M(W  +  As)  =  M(W)  +  A2  —L  l  [f  (s,X  )]  +  _  x  2A  J1  f(s,X  ) 

N.  a=l  a  N,  a=l  a 


N  ^  a=  1 


fi 

l 

a=l 


|f(W,Xn)  -  C (2/ 1)1  +  A2  —  l2  [f(s,Xg)]2  + 


N2  8=1 


. 
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W  ,.H 
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Let 


q2  n 

+  —  x  2A  l  f(s,X  )  [f (W,X  )  +  C (1/2)] 
N  3=1  0  3 


(B.  4) 


G  (s)  = 


ql  ?1  r-  .  X-,  2  _  (1)  ,  v  ^2 


a 


N,  a 


z  &C.VJ  i  V  (s)  =  —  J, 

=1  N  3=1 

2 


2 


G  (2)(s)  =  —  r  f (s,X  )  [f(W,X  )  -  c(2/l)] 
N,  a=l  a  a 


«1 


So 


(o\  ^2  N2 

Gg(2)(s)  =  _  5^  f(S>Xg)[f(W,X6)  +  C  ( 1  /  2 )] 


M(W  +  As)  +  M(W)  +  A2  [Ga(1)(s)  +  G  (1)  (s)]  +  2x[Ga(2)(s)  + 


3 

+  Gg<2)(s)] 


(B.5) 


dM(W  +  As) 


dA 


=  2A[Ga(U(s)  +  Gga)(s)]  +  2[Ga(2)(s)  + 

+  Gg(2>(s)]  (B.6) 


Setting  (B.6)  to  zero  gives  a  minimum  value  to  M(W  +  As).  So  the  value 


of  A  which  minimizes  M(W  +  As)  is  given  by 


Ga(2)(s)  +  Gr(2>(s)  G(2>(s) 


A  =  - 


Ga(1hs)  +  Gb(1)(s)  G(n(s) 


(B.7) 


where 


G(i)(s)  =  G, 


a 


(i)(s)  +  Gg(i)(s)  for  i  =  1,2 


This  value  gives  a  true  minimum  because  G^^(s)  -  0  which  means 
M(W  +  As)  is  a  convex  function  of  A. 


dM 

dA 


is  the  gradient  of  M(W)  at  W.  So  to  find  s  such  that  the 
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(X) 
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‘  3  +  <a)(1>  -  (a)  a 
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gradient  is  maximum  we  must  take 


dM 

max 

dX 


=  max  2G^  (s) 


X=0 


G(2)(s)  =  —  ^  f(s,X  )  [f(W,X0)  -  C ( 2 / 1 )]  + 

a=l 

92  ^2 

—  2  f(s,Xg)  [f(w,x6)  +  C(l/2)1 
n2  g-i 

We  can  define 


f(s,X)  =  (s • F (X) ) 


where  F(X)  is  the  $  function  of  the  pattern  vector. 

F (X)  =  (Xj2,  X22,  X^2,  Xx,  X2,  1) 

We  can  further  define  (s*F(X))  x  (W"F(X))  =  (As*W) 
by  defining  A  as  the  outerproduct  of  F(X)  with  itself. 


x  4 

X1 
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2  2 
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X  4 
x2 
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x3 

x2 
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A  = 

3 

X1  x2 

3 

xlx2 

2  2 
xrx2 

x12x2 

2 

xlx2 

Xi* 

V 

2 

X1X2 

xl2x2 

„  2 

X1 

xlx2 

X1 

x12x2 

x23 

xix22 

xlx2 

x  2 
x2 

x2 
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x22 

xlx2 

X1 

x2 

1 

G(2)(s)  =—  ^1[(A(Xct)s.W)  -  (s*F(Xa))  x  C ( 2 / 1 )] 

a=l 

cjo  No  _ 

+  —  l  [(A(X  )s-W)  +(s-F(X  ))x  C(1/2)J  (B.  8) 

N2  6=1 

(A(X)s'W)  =  (s’At(X)W) 
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so 


G 


(2) 


(s) 


qi 

—  I  (s " (At (X  )W  -  F (X  )  x  C (2/ 1) }) 

_  _  _  uc  Ot 


q2  N2 

+ —  l  (s*{At(Xfl)W  +  F(X  )  x  C(l/2) }) 
,T  „  1  P  P 


ql  rl  t  qi  rl 

=  (s.{  —  1  A  (xa)  W - l  F(Xa)  X  C (2/ 1) 

a= 1  a=l 


<12  S2  t  ^2  52 

+  —  Z  Ac(Xo)  W  +  —  I  F(X  ) 
N  3=1  N2  3=1 


C ( 1 / 2 ) }) 

(B.  9) 


Now,  by  Schwartz  inequality,  maximum  G 


(2) 


(s)  will  be  given  by 


s 


ql 


r  ni 

(  l  At(Xa))  W 
a=l 


N1 

l  F(Xa)  x  C (2/ 1) 


r-  n2  n2 

(  v  At(Xg))  W  +  l  F(Xg)  X  C ( 1 / 2 ) 
3=1  3=1 


(B. 10) 
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APPENDIX  C 


CALCULATION  OF  THE  ERROR  PROBABILITY 


FOR  THE  MEAN- SQUARE-ERROR  SOLUTION 


X1  a2  9  f2^Xl^ 


P  =  qj{l  - 


1  w 

—  f  e  ^  1 

j 

4tt  xi=+a1 


/ 

x2~f 1 (xi) 


rx22/8 


dx^dx^ }  + 


a2  f2(xi} 

+  q?U  -  —  /  e-(xi-2)2/2  j  o-(x2~2)2/8 


4tt  Xj-aj 


The  decision  boundary  is  defined  by 


x2=f x (xl> 


dx2dxi 


(C.l) 


-.2253x12  -  ,03314x22  -  .06276x^2  +  .2987xx  +  .09026x2  +  .8023  =  0 


(C.  2) 


Solving  for  X2  gives 


x2  =  .06276Xl  -  .09026  ±  (h(x  ))“*  /  2(-. 03314)  (C.3) 


where 


h(Xl)=  (,06276Xl  -  . 09026) 2  +  4 (. 03314) (. 8023  +  .2987xt 


-  .2253x1/) 

Taking  the  negative  square  root  will  give  f  (x^)  and  taking  the  positive 
square  root  will  give  f2(x^). 

To  find  a^  and  a2  we  must  solve  the  square  root  term  for  the  value  of 
x-j^  which  gives  0. 

0  =  (.06276xjL  -  O09026)2  +  4 (. 03314)  (. 8023  +  .2987Xl  -  ,2253Xl2) 
0  =  -.02592695x12  +  .02826623xx  +  .11339876  (C.4) 

which  yields  x^  =  -1.626  and  2.716  so  a^  =  -1.626  and  a2  =  2.716 
To  solve  the  integral 

f2(Xl)  2 

j  e~x2  dx2 
x2=f1(x1) 
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we  take 
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j  e-*22/8 
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dx„  = 


x. 


rfi<xi) 


0 

/  e  -  ax2  -r  j  e  --z  -  -dx2 
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/s7 
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/8tt 
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2 
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P1  qiU 


_  ^  Q 
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-  f  e~"2Xr 


4ttx2  a 


1 


erf 


/8~ 


-  erf 


fl(xi> 


/8~ 


dx1J 


(C.7) 


Doing  the  integration  by  computer  gives 

/8tt 

P  =  q , ( 1  -  -  (4.2686)}  =  .5(1  -  .8515}  =  .0743  (C.8) 

1  8rr 

In  a  similar  manner  P2  can  be  found. 
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(x^)-2 

/  /r 


du 
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(C .  9) 
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)  dx^  }  (C.10) 


Again  the  integration  had  to  be  done  by  computer  methods  giving 
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